Data Science Project
Title of dataset: “Kepler Exoplanet Search Results”
Source: Kaggle - https://www.kaggle.com/nasa/kepler-exoplanet-search-results
Authors: Matthew Bazzo, Soo Hyung Choe, Shiming Yan, Alex Zhang
The following dataset features 9564 observations, and 50 variables (12 categorical and 38 numerical), totally over 478000 data points. Some of the key variables are: disposition, (which tells us the status of the variable, confirmed, false positive, or candidate) koi score, 4 koi flags (4 tests to determine validity of planet, planetary data, and features of the star the planet revolves around. The dataset is generated by the Kepler telescope, which observes the data by looking at a star, and measuring the changes in brightness as an object moves between the star and the telescope.
We thought of many different questions and performed supervised and unsupervised learning to the dataset in order to find a solution to our questions.
“The Data Science Process is about observation, model building, analysis and conclusion” We thoroughly followed the data science process as shown below: 1. Ask questions and identify the problem 2. Data Collection 3. Data Exploration 4. Data Modeling 5. Data Analysis 6. Visualization and Presentation of Result
After looking at the data, we thought of many different questions and the problems we wanted to tackle. We seperated them into three categories depending on the technique that could be applied to the problem: exploratory data analysis (EDA), supervised learning, and unsuperised learning.
Exploratory Data Analysis
These questions are answered by basic visualizations and/or descriptive statistics.
Supervised Learning
These questions invoke the use of supervised learning techniques to develop predicitve models.
Unsupervised Learning
These questions, or problems, invoke the use of unsupervised learning techniques to devise labels for observations.
starData <- read.csv("cumulative.csv", header = TRUE, na.strings = "")
kepler_df <- starData
head(starData)
## rowid kepid kepoi_name kepler_name koi_disposition koi_pdisposition
## 1 1 10797460 K00752.01 Kepler-227 b CONFIRMED CANDIDATE
## 2 2 10797460 K00752.02 Kepler-227 c CONFIRMED CANDIDATE
## 3 3 10811496 K00753.01 <NA> FALSE POSITIVE FALSE POSITIVE
## 4 4 10848459 K00754.01 <NA> FALSE POSITIVE FALSE POSITIVE
## 5 5 10854555 K00755.01 Kepler-664 b CONFIRMED CANDIDATE
## 6 6 10872983 K00756.01 Kepler-228 d CONFIRMED CANDIDATE
## koi_score koi_fpflag_nt koi_fpflag_ss koi_fpflag_co koi_fpflag_ec
## 1 1.000 0 0 0 0
## 2 0.969 0 0 0 0
## 3 0.000 0 1 0 0
## 4 0.000 0 1 0 0
## 5 1.000 0 0 0 0
## 6 1.000 0 0 0 0
## koi_period koi_period_err1 koi_period_err2 koi_time0bk koi_time0bk_err1
## 1 9.488036 2.775e-05 -2.775e-05 170.5387 0.002160
## 2 54.418383 2.479e-04 -2.479e-04 162.5138 0.003520
## 3 19.899140 1.494e-05 -1.494e-05 175.8503 0.000581
## 4 1.736952 2.630e-07 -2.630e-07 170.3076 0.000115
## 5 2.525592 3.761e-06 -3.761e-06 171.5956 0.001130
## 6 11.094321 2.036e-05 -2.036e-05 171.2012 0.001410
## koi_time0bk_err2 koi_impact koi_impact_err1 koi_impact_err2 koi_duration
## 1 -0.002160 0.146 0.318 -0.146 2.95750
## 2 -0.003520 0.586 0.059 -0.443 4.50700
## 3 -0.000581 0.969 5.126 -0.077 1.78220
## 4 -0.000115 1.276 0.115 -0.092 2.40641
## 5 -0.001130 0.701 0.235 -0.478 1.65450
## 6 -0.001410 0.538 0.030 -0.428 4.59450
## koi_duration_err1 koi_duration_err2 koi_depth koi_depth_err1
## 1 0.08190 -0.08190 615.8 19.5
## 2 0.11600 -0.11600 874.8 35.5
## 3 0.03410 -0.03410 10829.0 171.0
## 4 0.00537 -0.00537 8079.2 12.8
## 5 0.04200 -0.04200 603.3 16.9
## 6 0.06100 -0.06100 1517.5 24.2
## koi_depth_err2 koi_prad koi_prad_err1 koi_prad_err2 koi_teq koi_teq_err1
## 1 -19.5 2.26 0.26 -0.15 793 NA
## 2 -35.5 2.83 0.32 -0.19 443 NA
## 3 -171.0 14.60 3.92 -1.31 638 NA
## 4 -12.8 33.46 8.50 -2.83 1395 NA
## 5 -16.9 2.75 0.88 -0.35 1406 NA
## 6 -24.2 3.90 1.27 -0.42 835 NA
## koi_teq_err2 koi_insol koi_insol_err1 koi_insol_err2 koi_model_snr
## 1 NA 93.59 29.45 -16.65 35.8
## 2 NA 9.11 2.87 -1.62 25.8
## 3 NA 39.30 31.04 -10.49 76.3
## 4 NA 891.96 668.95 -230.35 505.6
## 5 NA 926.16 874.33 -314.24 40.9
## 6 NA 114.81 112.85 -36.70 66.5
## koi_tce_plnt_num koi_tce_delivname koi_steff koi_steff_err1
## 1 1 q1_q17_dr25_tce 5455 81
## 2 2 q1_q17_dr25_tce 5455 81
## 3 1 q1_q17_dr25_tce 5853 158
## 4 1 q1_q17_dr25_tce 5805 157
## 5 1 q1_q17_dr25_tce 6031 169
## 6 1 q1_q17_dr25_tce 6046 189
## koi_steff_err2 koi_slogg koi_slogg_err1 koi_slogg_err2 koi_srad
## 1 -81 4.467 0.064 -0.096 0.927
## 2 -81 4.467 0.064 -0.096 0.927
## 3 -176 4.544 0.044 -0.176 0.868
## 4 -174 4.564 0.053 -0.168 0.791
## 5 -211 4.438 0.070 -0.210 1.046
## 6 -232 4.486 0.054 -0.229 0.972
## koi_srad_err1 koi_srad_err2 ra dec koi_kepmag
## 1 0.105 -0.061 291.9342 48.14165 15.347
## 2 0.105 -0.061 291.9342 48.14165 15.347
## 3 0.233 -0.078 297.0048 48.13413 15.436
## 4 0.201 -0.067 285.5346 48.28521 15.597
## 5 0.334 -0.133 288.7549 48.22620 15.509
## 6 0.315 -0.105 296.2861 48.22467 15.714
Load in all the required libraries.
library(Amelia)
## Warning: package 'Amelia' was built under R version 3.4.4
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2018 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(ggplot2)
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.4.4
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.4.4
library(rpart)
library(rpart.plot)
library(corrplot)
## corrplot 0.84 loaded
library(plotly)
## Warning: package 'plotly' was built under R version 3.4.4
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(rpart)
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(class)
library(e1071)
library(neuralnet)
## Warning: package 'neuralnet' was built under R version 3.4.4
First, we analyzed the relationship between koi_disposition and koi_pdisposition and found the similarities/differences.
levels(starData$koi_disposition)
## [1] "CANDIDATE" "CONFIRMED" "FALSE POSITIVE"
levels(starData$koi_pdisposition)
## [1] "CANDIDATE" "FALSE POSITIVE"
koi_disposition has two categories and koi_pdisposition has three categories. The two dispositions share two of the same categories,“CANDIDATE” & “FALSE POSITIVE” as koi_disposition has an additional category named “CONFIRMED”
Data1 <- subset(starData, koi_disposition =="CONFIRMED" & koi_pdisposition =="CANDIDATE")
Data2 <- subset(starData, koi_disposition =="CONFIRMED" & koi_pdisposition =="FALSE POSITIVE")
Data3 <- subset(starData, koi_disposition =="CANDIDATE" & koi_pdisposition =="CANDIDATE")
Data4 <- subset(starData, koi_disposition =="CANDIDATE" & koi_pdisposition =="FALSE POSITIVE")
Data5 <- subset(starData, koi_disposition =="FALSE POSITIVE" & koi_pdisposition =="CANDIDATE")
Data6 <- subset(starData, koi_disposition =="FALSE POSITIVE" & koi_pdisposition =="FALSE POSITIVE")
Out of 9564 rows, When koi_disposition is classified as Confirmed -> koi_pdisposition classified the instances as Candidate 2248 times -> koi_pdisposition classified the instances as False Positive 45 times -> misclassification rate of 2%
When koi_disposition is classified as Candidate -> koi_pdisposition classified the instances as Candidate 2248 times -> koi_pdisposition classified the instances as False Positive 0 times -> misclassification rate of 0%
When koi_disposition is classified as False Positive -> koi_pdisposition classified the instances as Candidate 0 times -> koi_pdisposition classified the instances as False Positive 5023 times -> misclassification rate of 0%
As seen in our analysis, the koi_disposition and koi_pdisposition are very similar. There are small discrepancies which are only present when koi_disposition is classified as Confirmed.
Plotting histogram of the koi_score to get a better understanding of the koi_score
hist(starData$koi_score)
As seen in the histogram above, the most frequent scores are located at 0 and 1. The other koi_scores make up a small percentage of the results and can be rounded to the nearest integer: either 0 or 1.
Analyzing the different column names of the Kepler dataset
titleLabels <- names(starData)
titleLabels
## [1] "rowid" "kepid" "kepoi_name"
## [4] "kepler_name" "koi_disposition" "koi_pdisposition"
## [7] "koi_score" "koi_fpflag_nt" "koi_fpflag_ss"
## [10] "koi_fpflag_co" "koi_fpflag_ec" "koi_period"
## [13] "koi_period_err1" "koi_period_err2" "koi_time0bk"
## [16] "koi_time0bk_err1" "koi_time0bk_err2" "koi_impact"
## [19] "koi_impact_err1" "koi_impact_err2" "koi_duration"
## [22] "koi_duration_err1" "koi_duration_err2" "koi_depth"
## [25] "koi_depth_err1" "koi_depth_err2" "koi_prad"
## [28] "koi_prad_err1" "koi_prad_err2" "koi_teq"
## [31] "koi_teq_err1" "koi_teq_err2" "koi_insol"
## [34] "koi_insol_err1" "koi_insol_err2" "koi_model_snr"
## [37] "koi_tce_plnt_num" "koi_tce_delivname" "koi_steff"
## [40] "koi_steff_err1" "koi_steff_err2" "koi_slogg"
## [43] "koi_slogg_err1" "koi_slogg_err2" "koi_srad"
## [46] "koi_srad_err1" "koi_srad_err2" "ra"
## [49] "dec" "koi_kepmag"
First, obtain a visual of the blank/missing/empty data:
head(kepler_df)
## rowid kepid kepoi_name kepler_name koi_disposition koi_pdisposition
## 1 1 10797460 K00752.01 Kepler-227 b CONFIRMED CANDIDATE
## 2 2 10797460 K00752.02 Kepler-227 c CONFIRMED CANDIDATE
## 3 3 10811496 K00753.01 <NA> FALSE POSITIVE FALSE POSITIVE
## 4 4 10848459 K00754.01 <NA> FALSE POSITIVE FALSE POSITIVE
## 5 5 10854555 K00755.01 Kepler-664 b CONFIRMED CANDIDATE
## 6 6 10872983 K00756.01 Kepler-228 d CONFIRMED CANDIDATE
## koi_score koi_fpflag_nt koi_fpflag_ss koi_fpflag_co koi_fpflag_ec
## 1 1.000 0 0 0 0
## 2 0.969 0 0 0 0
## 3 0.000 0 1 0 0
## 4 0.000 0 1 0 0
## 5 1.000 0 0 0 0
## 6 1.000 0 0 0 0
## koi_period koi_period_err1 koi_period_err2 koi_time0bk koi_time0bk_err1
## 1 9.488036 2.775e-05 -2.775e-05 170.5387 0.002160
## 2 54.418383 2.479e-04 -2.479e-04 162.5138 0.003520
## 3 19.899140 1.494e-05 -1.494e-05 175.8503 0.000581
## 4 1.736952 2.630e-07 -2.630e-07 170.3076 0.000115
## 5 2.525592 3.761e-06 -3.761e-06 171.5956 0.001130
## 6 11.094321 2.036e-05 -2.036e-05 171.2012 0.001410
## koi_time0bk_err2 koi_impact koi_impact_err1 koi_impact_err2 koi_duration
## 1 -0.002160 0.146 0.318 -0.146 2.95750
## 2 -0.003520 0.586 0.059 -0.443 4.50700
## 3 -0.000581 0.969 5.126 -0.077 1.78220
## 4 -0.000115 1.276 0.115 -0.092 2.40641
## 5 -0.001130 0.701 0.235 -0.478 1.65450
## 6 -0.001410 0.538 0.030 -0.428 4.59450
## koi_duration_err1 koi_duration_err2 koi_depth koi_depth_err1
## 1 0.08190 -0.08190 615.8 19.5
## 2 0.11600 -0.11600 874.8 35.5
## 3 0.03410 -0.03410 10829.0 171.0
## 4 0.00537 -0.00537 8079.2 12.8
## 5 0.04200 -0.04200 603.3 16.9
## 6 0.06100 -0.06100 1517.5 24.2
## koi_depth_err2 koi_prad koi_prad_err1 koi_prad_err2 koi_teq koi_teq_err1
## 1 -19.5 2.26 0.26 -0.15 793 NA
## 2 -35.5 2.83 0.32 -0.19 443 NA
## 3 -171.0 14.60 3.92 -1.31 638 NA
## 4 -12.8 33.46 8.50 -2.83 1395 NA
## 5 -16.9 2.75 0.88 -0.35 1406 NA
## 6 -24.2 3.90 1.27 -0.42 835 NA
## koi_teq_err2 koi_insol koi_insol_err1 koi_insol_err2 koi_model_snr
## 1 NA 93.59 29.45 -16.65 35.8
## 2 NA 9.11 2.87 -1.62 25.8
## 3 NA 39.30 31.04 -10.49 76.3
## 4 NA 891.96 668.95 -230.35 505.6
## 5 NA 926.16 874.33 -314.24 40.9
## 6 NA 114.81 112.85 -36.70 66.5
## koi_tce_plnt_num koi_tce_delivname koi_steff koi_steff_err1
## 1 1 q1_q17_dr25_tce 5455 81
## 2 2 q1_q17_dr25_tce 5455 81
## 3 1 q1_q17_dr25_tce 5853 158
## 4 1 q1_q17_dr25_tce 5805 157
## 5 1 q1_q17_dr25_tce 6031 169
## 6 1 q1_q17_dr25_tce 6046 189
## koi_steff_err2 koi_slogg koi_slogg_err1 koi_slogg_err2 koi_srad
## 1 -81 4.467 0.064 -0.096 0.927
## 2 -81 4.467 0.064 -0.096 0.927
## 3 -176 4.544 0.044 -0.176 0.868
## 4 -174 4.564 0.053 -0.168 0.791
## 5 -211 4.438 0.070 -0.210 1.046
## 6 -232 4.486 0.054 -0.229 0.972
## koi_srad_err1 koi_srad_err2 ra dec koi_kepmag
## 1 0.105 -0.061 291.9342 48.14165 15.347
## 2 0.105 -0.061 291.9342 48.14165 15.347
## 3 0.233 -0.078 297.0048 48.13413 15.436
## 4 0.201 -0.067 285.5346 48.28521 15.597
## 5 0.334 -0.133 288.7549 48.22620 15.509
## 6 0.315 -0.105 296.2861 48.22467 15.714
summary(kepler_df)
## rowid kepid kepoi_name kepler_name
## Min. : 1 Min. : 757450 K00001.01: 1 Kepler-1 b : 1
## 1st Qu.:2392 1st Qu.: 5556034 K00002.01: 1 Kepler-10 b : 1
## Median :4782 Median : 7906892 K00003.01: 1 Kepler-10 c : 1
## Mean :4782 Mean : 7690628 K00004.01: 1 Kepler-100 b: 1
## 3rd Qu.:7173 3rd Qu.: 9873066 K00005.01: 1 Kepler-100 c: 1
## Max. :9564 Max. :12935144 K00005.02: 1 (Other) :2289
## (Other) :9558 NA's :7270
## koi_disposition koi_pdisposition koi_score
## CANDIDATE :2248 CANDIDATE :4496 Min. :0.0000
## CONFIRMED :2293 FALSE POSITIVE:5068 1st Qu.:0.0000
## FALSE POSITIVE:5023 Median :0.3340
## Mean :0.4808
## 3rd Qu.:0.9980
## Max. :1.0000
## NA's :1510
## koi_fpflag_nt koi_fpflag_ss koi_fpflag_co koi_fpflag_ec
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.00
## Mean :0.1882 Mean :0.2316 Mean :0.1949 Mean :0.12
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00
##
## koi_period koi_period_err1 koi_period_err2 koi_time0bk
## Min. : 0.24 Min. :0.0000 Min. :-0.1725 Min. : 120.5
## 1st Qu.: 2.73 1st Qu.:0.0000 1st Qu.:-0.0003 1st Qu.: 132.8
## Median : 9.75 Median :0.0000 Median : 0.0000 Median : 137.2
## Mean : 75.67 Mean :0.0021 Mean :-0.0021 Mean : 166.2
## 3rd Qu.: 40.72 3rd Qu.:0.0003 3rd Qu.: 0.0000 3rd Qu.: 170.7
## Max. :129995.78 Max. :0.1725 Max. : 0.0000 Max. :1472.5
## NA's :454 NA's :454
## koi_time0bk_err1 koi_time0bk_err2 koi_impact koi_impact_err1
## Min. :0.0000 Min. :-0.5690 Min. : 0.0000 Min. : 0.000
## 1st Qu.:0.0012 1st Qu.:-0.0105 1st Qu.: 0.1970 1st Qu.: 0.040
## Median :0.0041 Median :-0.0041 Median : 0.5370 Median : 0.193
## Mean :0.0099 Mean :-0.0099 Mean : 0.7351 Mean : 1.960
## 3rd Qu.:0.0105 3rd Qu.:-0.0012 3rd Qu.: 0.8890 3rd Qu.: 0.378
## Max. :0.5690 Max. : 0.0000 Max. :100.8060 Max. :85.540
## NA's :454 NA's :454 NA's :363 NA's :454
## koi_impact_err2 koi_duration koi_duration_err1 koi_duration_err2
## Min. :-59.3200 Min. : 0.052 Min. : 0.0000 Min. :-20.2000
## 1st Qu.: -0.4450 1st Qu.: 2.438 1st Qu.: 0.0508 1st Qu.: -0.3500
## Median : -0.2070 Median : 3.793 Median : 0.1420 Median : -0.1420
## Mean : -0.3326 Mean : 5.622 Mean : 0.3399 Mean : -0.3399
## 3rd Qu.: -0.0460 3rd Qu.: 6.277 3rd Qu.: 0.3500 3rd Qu.: -0.0508
## Max. : 0.0000 Max. :138.540 Max. :20.2000 Max. : 0.0000
## NA's :454 NA's :454 NA's :454
## koi_depth koi_depth_err1 koi_depth_err2
## Min. : 0.0 Min. : 0.0 Min. :-388600.0
## 1st Qu.: 159.9 1st Qu.: 9.6 1st Qu.: -49.5
## Median : 421.1 Median : 20.8 Median : -20.8
## Mean : 23791.3 Mean : 123.2 Mean : -123.2
## 3rd Qu.: 1473.4 3rd Qu.: 49.5 3rd Qu.: -9.6
## Max. :1541400.0 Max. :388600.0 Max. : 0.0
## NA's :363 NA's :454 NA's :454
## koi_prad koi_prad_err1 koi_prad_err2
## Min. : 0.08 Min. : 0.00 Min. :-77180.00
## 1st Qu.: 1.40 1st Qu.: 0.23 1st Qu.: -1.94
## Median : 2.39 Median : 0.52 Median : -0.30
## Mean : 102.89 Mean : 17.66 Mean : -33.02
## 3rd Qu.: 14.93 3rd Qu.: 2.32 3rd Qu.: -0.14
## Max. :200346.00 Max. :21640.00 Max. : 0.00
## NA's :363 NA's :363 NA's :363
## koi_teq koi_teq_err1 koi_teq_err2 koi_insol
## Min. : 25 Mode:logical Mode:logical Min. : 0
## 1st Qu.: 539 NA's:9564 NA's:9564 1st Qu.: 20
## Median : 878 Median : 142
## Mean : 1085 Mean : 7746
## 3rd Qu.: 1379 3rd Qu.: 870
## Max. :14667 Max. :10947555
## NA's :363 NA's :321
## koi_insol_err1 koi_insol_err2 koi_model_snr koi_tce_plnt_num
## Min. : 0 Min. :-5600031 Min. : 0.0 Min. :1.000
## 1st Qu.: 9 1st Qu.: -287 1st Qu.: 12.0 1st Qu.:1.000
## Median : 73 Median : -40 Median : 23.0 Median :1.000
## Mean : 3751 Mean : -4044 Mean : 259.9 Mean :1.244
## 3rd Qu.: 519 3rd Qu.: -5 3rd Qu.: 78.0 3rd Qu.:1.000
## Max. :3617133 Max. : 0 Max. :9054.7 Max. :8.000
## NA's :321 NA's :321 NA's :363 NA's :346
## koi_tce_delivname koi_steff koi_steff_err1 koi_steff_err2
## q1_q16_tce : 796 Min. : 2661 Min. : 0.0 Min. :-1762.0
## q1_q17_dr24_tce: 368 1st Qu.: 5310 1st Qu.:106.0 1st Qu.: -198.0
## q1_q17_dr25_tce:8054 Median : 5767 Median :157.0 Median : -160.0
## NA's : 346 Mean : 5707 Mean :144.6 Mean : -162.3
## 3rd Qu.: 6112 3rd Qu.:174.0 3rd Qu.: -114.0
## Max. :15896 Max. :676.0 Max. : 0.0
## NA's :363 NA's :468 NA's :483
## koi_slogg koi_slogg_err1 koi_slogg_err2 koi_srad
## Min. :0.047 Min. :0.0000 Min. :-1.2070 Min. : 0.109
## 1st Qu.:4.218 1st Qu.:0.0420 1st Qu.:-0.1960 1st Qu.: 0.829
## Median :4.438 Median :0.0700 Median :-0.1280 Median : 1.000
## Mean :4.310 Mean :0.1207 Mean :-0.1432 Mean : 1.729
## 3rd Qu.:4.543 3rd Qu.:0.1490 3rd Qu.:-0.0880 3rd Qu.: 1.345
## Max. :5.364 Max. :1.4720 Max. : 0.0000 Max. :229.908
## NA's :363 NA's :468 NA's :468 NA's :363
## koi_srad_err1 koi_srad_err2 ra dec
## Min. : 0.0000 Min. :-116.1370 Min. :279.9 Min. :36.58
## 1st Qu.: 0.1290 1st Qu.: -0.2500 1st Qu.:288.7 1st Qu.:40.78
## Median : 0.2510 Median : -0.1110 Median :292.3 Median :43.68
## Mean : 0.3623 Mean : -0.3948 Mean :292.1 Mean :43.81
## 3rd Qu.: 0.3640 3rd Qu.: -0.0690 3rd Qu.:295.9 3rd Qu.:46.71
## Max. :33.0910 Max. : 0.0000 Max. :301.7 Max. :52.34
## NA's :468 NA's :468
## koi_kepmag
## Min. : 6.966
## 1st Qu.:13.440
## Median :14.520
## Mean :14.265
## 3rd Qu.:15.322
## Max. :20.003
## NA's :1
missmap(kepler_df)
Let’s get rid of the error measurements that are fully blank. Let’s also get rid of some obviously useless features:
kepler_df$koi_teq_err1 <- NULL
kepler_df$koi_teq_err2 <- NULL
kepler_df$rowid <- NULL
kepler_df$kepid <- NULL
We should note that koi_teq_err1 and koi_teq_err2, which were deleted, quantify the error margin for the effective temperature for planets.
There are still a fair number of “NA” values remaining, but the data frame is workable from here.
missmap(kepler_df)
Let’s generate some summary stats:
summary(kepler_df)
## kepoi_name kepler_name koi_disposition
## K00001.01: 1 Kepler-1 b : 1 CANDIDATE :2248
## K00002.01: 1 Kepler-10 b : 1 CONFIRMED :2293
## K00003.01: 1 Kepler-10 c : 1 FALSE POSITIVE:5023
## K00004.01: 1 Kepler-100 b: 1
## K00005.01: 1 Kepler-100 c: 1
## K00005.02: 1 (Other) :2289
## (Other) :9558 NA's :7270
## koi_pdisposition koi_score koi_fpflag_nt koi_fpflag_ss
## CANDIDATE :4496 Min. :0.0000 Min. :0.0000 Min. :0.0000
## FALSE POSITIVE:5068 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.3340 Median :0.0000 Median :0.0000
## Mean :0.4808 Mean :0.1882 Mean :0.2316
## 3rd Qu.:0.9980 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## NA's :1510
## koi_fpflag_co koi_fpflag_ec koi_period koi_period_err1
## Min. :0.0000 Min. :0.00 Min. : 0.24 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.: 2.73 1st Qu.:0.0000
## Median :0.0000 Median :0.00 Median : 9.75 Median :0.0000
## Mean :0.1949 Mean :0.12 Mean : 75.67 Mean :0.0021
## 3rd Qu.:0.0000 3rd Qu.:0.00 3rd Qu.: 40.72 3rd Qu.:0.0003
## Max. :1.0000 Max. :1.00 Max. :129995.78 Max. :0.1725
## NA's :454
## koi_period_err2 koi_time0bk koi_time0bk_err1 koi_time0bk_err2
## Min. :-0.1725 Min. : 120.5 Min. :0.0000 Min. :-0.5690
## 1st Qu.:-0.0003 1st Qu.: 132.8 1st Qu.:0.0012 1st Qu.:-0.0105
## Median : 0.0000 Median : 137.2 Median :0.0041 Median :-0.0041
## Mean :-0.0021 Mean : 166.2 Mean :0.0099 Mean :-0.0099
## 3rd Qu.: 0.0000 3rd Qu.: 170.7 3rd Qu.:0.0105 3rd Qu.:-0.0012
## Max. : 0.0000 Max. :1472.5 Max. :0.5690 Max. : 0.0000
## NA's :454 NA's :454 NA's :454
## koi_impact koi_impact_err1 koi_impact_err2 koi_duration
## Min. : 0.0000 Min. : 0.000 Min. :-59.3200 Min. : 0.052
## 1st Qu.: 0.1970 1st Qu.: 0.040 1st Qu.: -0.4450 1st Qu.: 2.438
## Median : 0.5370 Median : 0.193 Median : -0.2070 Median : 3.793
## Mean : 0.7351 Mean : 1.960 Mean : -0.3326 Mean : 5.622
## 3rd Qu.: 0.8890 3rd Qu.: 0.378 3rd Qu.: -0.0460 3rd Qu.: 6.277
## Max. :100.8060 Max. :85.540 Max. : 0.0000 Max. :138.540
## NA's :363 NA's :454 NA's :454
## koi_duration_err1 koi_duration_err2 koi_depth
## Min. : 0.0000 Min. :-20.2000 Min. : 0.0
## 1st Qu.: 0.0508 1st Qu.: -0.3500 1st Qu.: 159.9
## Median : 0.1420 Median : -0.1420 Median : 421.1
## Mean : 0.3399 Mean : -0.3399 Mean : 23791.3
## 3rd Qu.: 0.3500 3rd Qu.: -0.0508 3rd Qu.: 1473.4
## Max. :20.2000 Max. : 0.0000 Max. :1541400.0
## NA's :454 NA's :454 NA's :363
## koi_depth_err1 koi_depth_err2 koi_prad
## Min. : 0.0 Min. :-388600.0 Min. : 0.08
## 1st Qu.: 9.6 1st Qu.: -49.5 1st Qu.: 1.40
## Median : 20.8 Median : -20.8 Median : 2.39
## Mean : 123.2 Mean : -123.2 Mean : 102.89
## 3rd Qu.: 49.5 3rd Qu.: -9.6 3rd Qu.: 14.93
## Max. :388600.0 Max. : 0.0 Max. :200346.00
## NA's :454 NA's :454 NA's :363
## koi_prad_err1 koi_prad_err2 koi_teq koi_insol
## Min. : 0.00 Min. :-77180.00 Min. : 25 Min. : 0
## 1st Qu.: 0.23 1st Qu.: -1.94 1st Qu.: 539 1st Qu.: 20
## Median : 0.52 Median : -0.30 Median : 878 Median : 142
## Mean : 17.66 Mean : -33.02 Mean : 1085 Mean : 7746
## 3rd Qu.: 2.32 3rd Qu.: -0.14 3rd Qu.: 1379 3rd Qu.: 870
## Max. :21640.00 Max. : 0.00 Max. :14667 Max. :10947555
## NA's :363 NA's :363 NA's :363 NA's :321
## koi_insol_err1 koi_insol_err2 koi_model_snr koi_tce_plnt_num
## Min. : 0 Min. :-5600031 Min. : 0.0 Min. :1.000
## 1st Qu.: 9 1st Qu.: -287 1st Qu.: 12.0 1st Qu.:1.000
## Median : 73 Median : -40 Median : 23.0 Median :1.000
## Mean : 3751 Mean : -4044 Mean : 259.9 Mean :1.244
## 3rd Qu.: 519 3rd Qu.: -5 3rd Qu.: 78.0 3rd Qu.:1.000
## Max. :3617133 Max. : 0 Max. :9054.7 Max. :8.000
## NA's :321 NA's :321 NA's :363 NA's :346
## koi_tce_delivname koi_steff koi_steff_err1 koi_steff_err2
## q1_q16_tce : 796 Min. : 2661 Min. : 0.0 Min. :-1762.0
## q1_q17_dr24_tce: 368 1st Qu.: 5310 1st Qu.:106.0 1st Qu.: -198.0
## q1_q17_dr25_tce:8054 Median : 5767 Median :157.0 Median : -160.0
## NA's : 346 Mean : 5707 Mean :144.6 Mean : -162.3
## 3rd Qu.: 6112 3rd Qu.:174.0 3rd Qu.: -114.0
## Max. :15896 Max. :676.0 Max. : 0.0
## NA's :363 NA's :468 NA's :483
## koi_slogg koi_slogg_err1 koi_slogg_err2 koi_srad
## Min. :0.047 Min. :0.0000 Min. :-1.2070 Min. : 0.109
## 1st Qu.:4.218 1st Qu.:0.0420 1st Qu.:-0.1960 1st Qu.: 0.829
## Median :4.438 Median :0.0700 Median :-0.1280 Median : 1.000
## Mean :4.310 Mean :0.1207 Mean :-0.1432 Mean : 1.729
## 3rd Qu.:4.543 3rd Qu.:0.1490 3rd Qu.:-0.0880 3rd Qu.: 1.345
## Max. :5.364 Max. :1.4720 Max. : 0.0000 Max. :229.908
## NA's :363 NA's :468 NA's :468 NA's :363
## koi_srad_err1 koi_srad_err2 ra dec
## Min. : 0.0000 Min. :-116.1370 Min. :279.9 Min. :36.58
## 1st Qu.: 0.1290 1st Qu.: -0.2500 1st Qu.:288.7 1st Qu.:40.78
## Median : 0.2510 Median : -0.1110 Median :292.3 Median :43.68
## Mean : 0.3623 Mean : -0.3948 Mean :292.1 Mean :43.81
## 3rd Qu.: 0.3640 3rd Qu.: -0.0690 3rd Qu.:295.9 3rd Qu.:46.71
## Max. :33.0910 Max. : 0.0000 Max. :301.7 Max. :52.34
## NA's :468 NA's :468
## koi_kepmag
## Min. : 6.966
## 1st Qu.:13.440
## Median :14.520
## Mean :14.265
## 3rd Qu.:15.322
## Max. :20.003
## NA's :1
Summary information on the structure of the dataframe:
str(kepler_df)
## 'data.frame': 9564 obs. of 46 variables:
## $ kepoi_name : Factor w/ 9564 levels "K00001.01","K00002.01",..: 1081 1082 1083 1084 1085 1086 1087 1088 108 1089 ...
## $ kepler_name : Factor w/ 2294 levels "Kepler-1 b","Kepler-10 b",..: 1036 1037 NA NA 1868 1040 1039 1038 NA 1042 ...
## $ koi_disposition : Factor w/ 3 levels "CANDIDATE","CONFIRMED",..: 2 2 3 3 2 2 2 2 3 2 ...
## $ koi_pdisposition : Factor w/ 2 levels "CANDIDATE","FALSE POSITIVE": 1 1 2 2 1 1 1 1 2 1 ...
## $ koi_score : num 1 0.969 0 0 1 1 1 0.992 0 1 ...
## $ koi_fpflag_nt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ koi_fpflag_ss : int 0 0 1 1 0 0 0 0 1 0 ...
## $ koi_fpflag_co : int 0 0 0 0 0 0 0 0 1 0 ...
## $ koi_fpflag_ec : int 0 0 0 0 0 0 0 0 0 0 ...
## $ koi_period : num 9.49 54.42 19.9 1.74 2.53 ...
## $ koi_period_err1 : num 2.78e-05 2.48e-04 1.49e-05 2.63e-07 3.76e-06 ...
## $ koi_period_err2 : num -2.78e-05 -2.48e-04 -1.49e-05 -2.63e-07 -3.76e-06 ...
## $ koi_time0bk : num 171 163 176 170 172 ...
## $ koi_time0bk_err1 : num 0.00216 0.00352 0.000581 0.000115 0.00113 0.00141 0.0019 0.00461 0.00253 0.000517 ...
## $ koi_time0bk_err2 : num -0.00216 -0.00352 -0.000581 -0.000115 -0.00113 -0.00141 -0.0019 -0.00461 -0.00253 -0.000517 ...
## $ koi_impact : num 0.146 0.586 0.969 1.276 0.701 ...
## $ koi_impact_err1 : num 0.318 0.059 5.126 0.115 0.235 ...
## $ koi_impact_err2 : num -0.146 -0.443 -0.077 -0.092 -0.478 -0.428 -0.532 -0.523 -0.044 -0.052 ...
## $ koi_duration : num 2.96 4.51 1.78 2.41 1.65 ...
## $ koi_duration_err1: num 0.0819 0.116 0.0341 0.00537 0.042 0.061 0.0673 0.165 0.136 0.0241 ...
## $ koi_duration_err2: num -0.0819 -0.116 -0.0341 -0.00537 -0.042 -0.061 -0.0673 -0.165 -0.136 -0.0241 ...
## $ koi_depth : num 616 875 10829 8079 603 ...
## $ koi_depth_err1 : num 19.5 35.5 171 12.8 16.9 24.2 18.7 16.8 5.8 33.3 ...
## $ koi_depth_err2 : num -19.5 -35.5 -171 -12.8 -16.9 -24.2 -18.7 -16.8 -5.8 -33.3 ...
## $ koi_prad : num 2.26 2.83 14.6 33.46 2.75 ...
## $ koi_prad_err1 : num 0.26 0.32 3.92 8.5 0.88 1.27 0.9 0.52 6.45 0.22 ...
## $ koi_prad_err2 : num -0.15 -0.19 -1.31 -2.83 -0.35 -0.42 -0.3 -0.17 -9.67 -0.49 ...
## $ koi_teq : num 793 443 638 1395 1406 ...
## $ koi_insol : num 93.59 9.11 39.3 891.96 926.16 ...
## $ koi_insol_err1 : num 29.45 2.87 31.04 668.95 874.33 ...
## $ koi_insol_err2 : num -16.65 -1.62 -10.49 -230.35 -314.24 ...
## $ koi_model_snr : num 35.8 25.8 76.3 505.6 40.9 ...
## $ koi_tce_plnt_num : int 1 2 1 1 1 1 2 3 1 1 ...
## $ koi_tce_delivname: Factor w/ 3 levels "q1_q16_tce","q1_q17_dr24_tce",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ koi_steff : num 5455 5455 5853 5805 6031 ...
## $ koi_steff_err1 : num 81 81 158 157 169 189 189 189 111 75 ...
## $ koi_steff_err2 : num -81 -81 -176 -174 -211 -232 -232 -232 -124 -83 ...
## $ koi_slogg : num 4.47 4.47 4.54 4.56 4.44 ...
## $ koi_slogg_err1 : num 0.064 0.064 0.044 0.053 0.07 0.054 0.054 0.054 0.182 0.083 ...
## $ koi_slogg_err2 : num -0.096 -0.096 -0.176 -0.168 -0.21 -0.229 -0.229 -0.229 -0.098 -0.028 ...
## $ koi_srad : num 0.927 0.927 0.868 0.791 1.046 ...
## $ koi_srad_err1 : num 0.105 0.105 0.233 0.201 0.334 0.315 0.315 0.315 0.322 0.033 ...
## $ koi_srad_err2 : num -0.061 -0.061 -0.078 -0.067 -0.133 -0.105 -0.105 -0.105 -0.483 -0.072 ...
## $ ra : num 292 292 297 286 289 ...
## $ dec : num 48.1 48.1 48.1 48.3 48.2 ...
## $ koi_kepmag : num 15.3 15.3 15.4 15.6 15.5 ...
We see there are many numeric values, some of which pertain to esoteric astronomical measures. Part of the ensuing investigation into the data will be to test our understanding of what certain features mean.
# First extract only those rows where the planet is named.
named_planets_df <- subset(kepler_df, !is.na(kepler_name))
missmap(named_planets_df)
summary(named_planets_df$koi_teq)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 129.0 554.0 781.0 839.1 1039.0 3559.0 1
summary(named_planets_df$koi_prad)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.270 1.530 2.170 2.879 2.940 77.760 1
There is a “false positive flag” associated with the likely presence of a binary star (i.e. it is set to ‘1’ if the observed light curve is likely due to a binary star). We want to use these flags to determine what proportion of candidate planets are found around (probable) binary stars. We also want to compare what the literature says regarding the planetary hosting ability of binary stars to what the Kepler analysis suggests.
# Let's create labels for the binary star false positive flag
for (i in 1:dim(kepler_df)[1]) {
if (kepler_df$koi_fpflag_ss[i] == 0)
kepler_df$koi_fpflag_ss[i] <- "No binary star detected"
else
kepler_df$koi_fpflag_ss[i] <- "Probable binary star"
}
# Plot the dispositions according to Kepler data analysis
plot1 <- ggplot(kepler_df, aes(x = koi_fpflag_ss, fill = koi_pdisposition)) + geom_bar(position = "fill")
plot2 <- ggplot(kepler_df, aes(x = koi_fpflag_ss, fill = koi_disposition)) + geom_bar(position = "fill")
plot1
plot2
The above plots show that binary stars have a much smaller proportion of likely planets encircling them than do single stars, and this is seen in both the Kepler analysis labels and literature labels. But this may not reflect the actual capability of binary stars to host planets. These plots could just be a reflection of how difficult it is to detect planets encircling binary star systems.
Using the data available we can filter planets according to their likeness to Earth and treat these planets as likely being “habitable”. Although, we should note that the filtering is very crude. The only two criteria we can rely on to filter planets are effective temperature (koi_teq) and radius (koi_prad). Using only these two criteria are not enough to determine the habitableness of a planet. Other data regarding planet composition, for example, are needed to make definitive judgement on habitability. But we’ll proceed with the crude method for the purpose of this analysis.
We’ll take the habitable planets to be approximately Earth size and within a temperature range to support liquid water on the surface. Also, we’ll only look at planets with decent koi_score values (where koi_score is a measure of how certain scientists are the corresponding observation is a planet).
habit_df <- subset(kepler_df, koi_prad >= 0.5 & koi_prad <= 2.0 & koi_teq >= 273 & koi_teq <= 373 & koi_pdisposition == "CANDIDATE" & koi_score >= 0.4)
str(habit_df)
## 'data.frame': 52 obs. of 46 variables:
## $ kepoi_name : Factor w/ 9564 levels "K00001.01","K00002.01",..: 1112 1160 1161 1167 1271 1291 1298 1419 1536 1369 ...
## $ kepler_name : Factor w/ 2294 levels "Kepler-1 b","Kepler-10 b",..: 1674 1059 1060 1062 1703 1100 1716 1137 1154 1951 ...
## $ koi_disposition : Factor w/ 3 levels "CANDIDATE","CONFIRMED",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ koi_pdisposition : Factor w/ 2 levels "CANDIDATE","FALSE POSITIVE": 1 1 1 1 1 1 1 1 1 1 ...
## $ koi_score : num 1 1 1 1 0.986 0.998 1 0.881 0.992 1 ...
## $ koi_fpflag_nt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ koi_fpflag_ss : chr "No binary star detected" "No binary star detected" "No binary star detected" "No binary star detected" ...
## $ koi_fpflag_co : int 0 0 0 0 0 0 0 0 0 0 ...
## $ koi_fpflag_ec : int 0 0 0 0 0 0 0 0 0 0 ...
## $ koi_period : num 36.4 20.1 46.2 24 21 ...
## $ koi_period_err1 : num 1.81e-04 5.84e-05 2.65e-04 9.25e-05 7.66e-05 ...
## $ koi_period_err2 : num -1.81e-04 -5.84e-05 -2.65e-04 -9.25e-05 -7.66e-05 ...
## $ koi_time0bk : num 152 147 165 186 152 ...
## $ koi_time0bk_err1 : num 0.00392 0.00241 0.0043 0.00305 0.00305 0.00229 0.00485 0.00932 0.00254 0.00157 ...
## $ koi_time0bk_err2 : num -0.00392 -0.00241 -0.0043 -0.00305 -0.00305 -0.00229 -0.00485 -0.00932 -0.00254 -0.00157 ...
## $ koi_impact : num 0.028 0.556 0.013 0.416 0.228 0.115 0.045 0.015 0.035 0.009 ...
## $ koi_impact_err1 : num 0.437 0.323 0.415 0.053 0.176 0.32 0.393 0.464 0.395 0.374 ...
## $ koi_impact_err2 : num -0.028 -0.368 -0.013 -0.416 -0.228 -0.115 -0.045 -0.015 -0.035 -0.009 ...
## $ koi_duration : num 4.01 3.32 4.76 3.83 2.63 ...
## $ koi_duration_err1: num 0.126 0.0754 0.13 0.113 0.0866 0.069 0.155 0.331 0.0806 0.0556 ...
## $ koi_duration_err2: num -0.126 -0.0754 -0.13 -0.113 -0.0866 -0.069 -0.155 -0.331 -0.0806 -0.0556 ...
## $ koi_depth : num 1122 1495 1395 1182 767 ...
## $ koi_depth_err1 : num 53.4 45.1 56.4 46.1 42.9 26.4 43.6 27.7 60.3 33.9 ...
## $ koi_depth_err2 : num -53.4 -45.1 -56.4 -46.1 -42.9 -26.4 -43.6 -27.7 -60.3 -33.9 ...
## $ koi_prad : num 1.99 1.96 1.83 1.8 1.3 1.11 1.85 1.75 1.87 1.83 ...
## $ koi_prad_err1 : num 0.09 0.13 0.12 0.1 0.1 0.12 0.13 0.11 0.14 0.16 ...
## $ koi_prad_err2 : num -0.09 -0.16 -0.15 -0.15 -0.15 -0.16 -0.05 -0.13 -0.22 -0.21 ...
## $ koi_teq : num 332 361 273 329 328 332 349 372 301 298 ...
## $ koi_insol : num 2.88 4 1.32 2.77 2.74 2.86 3.49 4.53 1.95 1.87 ...
## $ koi_insol_err1 : num 0.51 0.89 0.29 0.58 0.7 0.95 0.81 1.04 0.49 0.52 ...
## $ koi_insol_err2 : num -0.47 -0.9 -0.3 -0.64 -0.79 -0.98 -0.47 -0.95 -0.55 -0.53 ...
## $ koi_model_snr : num 21.7 35.2 26 27.5 18.2 30.9 21.6 17.4 28.9 53.8 ...
## $ koi_tce_plnt_num : int 3 2 3 1 3 3 5 2 3 1 ...
## $ koi_tce_delivname: Factor w/ 3 levels "q1_q16_tce","q1_q17_dr24_tce",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ koi_steff : num 4126 3950 3950 3747 3713 ...
## $ koi_steff_err1 : num 82 70 70 75 74 71 90 105 75 75 ...
## $ koi_steff_err2 : num -82 -86 -86 -83 -92 -89 -90 -105 -82 -84 ...
## $ koi_slogg : num 4.66 4.75 4.75 4.73 4.78 ...
## $ koi_slogg_err1 : num 0.022 0.042 0.042 0.042 0.063 0.055 0.013 0.072 0.063 0.063 ...
## $ koi_slogg_err2 : num -0.022 -0.031 -0.031 -0.025 -0.031 -0.055 -0.043 -0.048 -0.031 -0.031 ...
## $ koi_srad : num 0.615 0.493 0.493 0.524 0.47 0.411 0.646 0.849 0.46 0.461 ...
## $ koi_srad_err1 : num 0.027 0.033 0.033 0.03 0.036 0.045 0.043 0.056 0.035 0.04 ...
## $ koi_srad_err2 : num -0.029 -0.04 -0.04 -0.044 -0.054 -0.06 -0.02 -0.063 -0.053 -0.053 ...
## $ ra : num 287 286 286 284 295 ...
## $ dec : num 50 39.3 39.3 39.9 43.1 ...
## $ koi_kepmag : num 15.1 16 16 15.4 15.8 ...
summary(habit_df)
## kepoi_name kepler_name koi_disposition
## K00172.02: 1 Kepler-1185 b: 1 CANDIDATE :23
## K00238.03: 1 Kepler-138 d : 1 CONFIRMED :29
## K00248.04: 1 Kepler-1450 b: 1 FALSE POSITIVE: 0
## K00253.02: 1 Kepler-1459 b: 1
## K00314.02: 1 Kepler-1512 b: 1
## K00494.01: 1 (Other) :24
## (Other) :46 NA's :23
## koi_pdisposition koi_score koi_fpflag_nt koi_fpflag_ss
## CANDIDATE :52 Min. :0.5230 Min. :0 Length:52
## FALSE POSITIVE: 0 1st Qu.:0.9223 1st Qu.:0 Class :character
## Median :0.9920 Median :0 Mode :character
## Mean :0.9308 Mean :0
## 3rd Qu.:1.0000 3rd Qu.:0
## Max. :1.0000 Max. :0
##
## koi_fpflag_co koi_fpflag_ec koi_period koi_period_err1
## Min. :0 Min. :0 Min. : 4.486 Min. :0.0000077
## 1st Qu.:0 1st Qu.:0 1st Qu.: 20.901 1st Qu.:0.0000753
## Median :0 Median :0 Median : 37.498 Median :0.0002705
## Mean :0 Mean :0 Mean : 63.967 Mean :0.0010320
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.: 77.912 3rd Qu.:0.0009774
## Max. :0 Max. :0 Max. :362.978 Max. :0.0159900
## NA's :1
## koi_period_err2 koi_time0bk koi_time0bk_err1
## Min. :-0.0159900 Min. :131.1 Min. :0.000811
## 1st Qu.:-0.0009774 1st Qu.:139.9 1st Qu.:0.002710
## Median :-0.0002705 Median :151.2 Median :0.005190
## Mean :-0.0010320 Mean :160.7 Mean :0.008115
## 3rd Qu.:-0.0000753 3rd Qu.:166.1 3rd Qu.:0.009940
## Max. :-0.0000077 Max. :280.1 Max. :0.051800
## NA's :1 NA's :1
## koi_time0bk_err2 koi_impact koi_impact_err1 koi_impact_err2
## Min. :-0.051800 Min. :0.00400 Min. :0.0000 Min. :-0.7030
## 1st Qu.:-0.009940 1st Qu.:0.04575 1st Qu.:0.0740 1st Qu.:-0.4360
## Median :-0.005190 Median :0.22250 Median :0.3140 Median :-0.2170
## Mean :-0.008115 Mean :0.32267 Mean :0.2604 Mean :-0.2559
## 3rd Qu.:-0.002710 3rd Qu.:0.53400 3rd Qu.:0.4120 3rd Qu.:-0.0455
## Max. :-0.000811 Max. :0.95400 Max. :0.5350 Max. :-0.0040
## NA's :1 NA's :1 NA's :1
## koi_duration koi_duration_err1 koi_duration_err2 koi_depth
## Min. : 0.8161 Min. :0.0301 Min. :-1.4000 Min. : 243.8
## 1st Qu.: 2.4299 1st Qu.:0.0836 1st Qu.:-0.3035 1st Qu.: 374.8
## Median : 3.7890 Median :0.1530 Median :-0.1530 Median : 606.9
## Mean : 4.4728 Mean :0.2433 Mean :-0.2433 Mean : 843.8
## 3rd Qu.: 5.0238 3rd Qu.:0.3035 3rd Qu.:-0.0836 3rd Qu.: 922.5
## Max. :16.0300 Max. :1.4000 Max. :-0.0301 Max. :6462.0
## NA's :1 NA's :1
## koi_depth_err1 koi_depth_err2 koi_prad koi_prad_err1
## Min. : 8.90 Min. :-203.00 Min. :0.790 Min. :0.0500
## 1st Qu.: 26.95 1st Qu.: -54.35 1st Qu.:1.270 1st Qu.:0.0975
## Median : 37.40 Median : -37.40 Median :1.565 Median :0.1200
## Mean : 47.65 Mean : -47.65 Mean :1.529 Mean :0.1542
## 3rd Qu.: 54.35 3rd Qu.: -26.95 3rd Qu.:1.830 3rd Qu.:0.1650
## Max. :203.00 Max. : -8.90 Max. :1.990 Max. :0.7000
## NA's :1 NA's :1
## koi_prad_err2 koi_teq koi_insol koi_insol_err1
## Min. :-0.3300 Min. :273.0 Min. :1.320 Min. :0.2900
## 1st Qu.:-0.1725 1st Qu.:304.5 1st Qu.:2.045 1st Qu.:0.5275
## Median :-0.1450 Median :329.5 Median :2.780 Median :0.7000
## Mean :-0.1458 Mean :327.0 Mean :2.812 Mean :0.9652
## 3rd Qu.:-0.1175 3rd Qu.:349.0 3rd Qu.:3.495 3rd Qu.:1.0500
## Max. :-0.0500 Max. :373.0 Max. :4.590 Max. :4.1600
##
## koi_insol_err2 koi_model_snr koi_tce_plnt_num
## Min. :-1.6700 Min. : 5.10 Min. :1.000
## 1st Qu.:-0.9600 1st Qu.:12.97 1st Qu.:1.000
## Median :-0.6300 Median :16.55 Median :1.000
## Mean :-0.7208 Mean :21.42 Mean :1.635
## 3rd Qu.:-0.4775 3rd Qu.:27.20 3rd Qu.:2.000
## Max. :-0.3000 Max. :57.40 Max. :5.000
##
## koi_tce_delivname koi_steff koi_steff_err1 koi_steff_err2
## q1_q16_tce : 0 Min. :3157 Min. : 41.00 Min. :-219.0
## q1_q17_dr24_tce: 0 1st Qu.:3750 1st Qu.: 74.00 1st Qu.:-129.2
## q1_q17_dr25_tce:52 Median :4129 Median : 82.50 Median : -88.5
## Mean :4367 Mean : 97.94 Mean :-103.9
## 3rd Qu.:4899 3rd Qu.:115.50 3rd Qu.: -82.0
## Max. :6086 Max. :219.00 Max. : -25.0
##
## koi_slogg koi_slogg_err1 koi_slogg_err2 koi_srad
## Min. :4.274 Min. :0.01000 Min. :-0.20400 Min. :0.1800
## 1st Qu.:4.560 1st Qu.:0.03600 1st Qu.:-0.05525 1st Qu.:0.4753
## Median :4.691 Median :0.05300 Median :-0.03550 Median :0.5580
## Mean :4.684 Mean :0.05679 Mean :-0.05183 Mean :0.6104
## 3rd Qu.:4.776 3rd Qu.:0.06750 3rd Qu.:-0.03075 3rd Qu.:0.7750
## Max. :5.112 Max. :0.13700 Max. :-0.01000 Max. :1.2160
##
## koi_srad_err1 koi_srad_err2 ra dec
## Min. :0.02200 Min. :-0.17900 Min. :281.6 Min. :37.36
## 1st Qu.:0.03300 1st Qu.:-0.06500 1st Qu.:286.7 1st Qu.:41.02
## Median :0.04350 Median :-0.05000 Median :290.8 Median :43.99
## Mean :0.06133 Mean :-0.05729 Mean :290.7 Mean :43.76
## 3rd Qu.:0.05875 3rd Qu.:-0.03875 3rd Qu.:294.6 3rd Qu.:46.13
## Max. :0.23800 Max. :-0.02000 Max. :299.8 Max. :50.70
##
## koi_kepmag
## Min. :12.57
## 1st Qu.:14.40
## Median :15.21
## Mean :15.00
## 3rd Qu.:15.73
## Max. :17.48
##
head(habit_df)
## kepoi_name kepler_name koi_disposition koi_pdisposition koi_score
## 57 K00775.03 Kepler-52 d CONFIRMED CANDIDATE 1.000
## 86 K00812.02 Kepler-235 d CONFIRMED CANDIDATE 1.000
## 87 K00812.03 Kepler-235 e CONFIRMED CANDIDATE 1.000
## 115 K00817.01 Kepler-236 c CONFIRMED CANDIDATE 1.000
## 223 K00886.03 Kepler-54 d CONFIRMED CANDIDATE 0.986
## 246 K00899.03 Kepler-249 d CONFIRMED CANDIDATE 0.998
## koi_fpflag_nt koi_fpflag_ss koi_fpflag_co koi_fpflag_ec
## 57 0 No binary star detected 0 0
## 86 0 No binary star detected 0 0
## 87 0 No binary star detected 0 0
## 115 0 No binary star detected 0 0
## 223 0 No binary star detected 0 0
## 246 0 No binary star detected 0 0
## koi_period koi_period_err1 koi_period_err2 koi_time0bk
## 57 36.44540 1.809e-04 -1.809e-04 151.6012
## 86 20.06036 5.839e-05 -5.839e-05 147.4655
## 87 46.18420 2.654e-04 -2.654e-04 165.2373
## 115 23.96794 9.249e-05 -9.249e-05 186.2218
## 223 20.99588 7.662e-05 -7.662e-05 152.3155
## 246 15.36846 4.067e-05 -4.067e-05 147.3896
## koi_time0bk_err1 koi_time0bk_err2 koi_impact koi_impact_err1
## 57 0.00392 -0.00392 0.028 0.437
## 86 0.00241 -0.00241 0.556 0.323
## 87 0.00430 -0.00430 0.013 0.415
## 115 0.00305 -0.00305 0.416 0.053
## 223 0.00305 -0.00305 0.228 0.176
## 246 0.00229 -0.00229 0.115 0.320
## koi_impact_err2 koi_duration koi_duration_err1 koi_duration_err2
## 57 -0.028 4.0070 0.1260 -0.1260
## 86 -0.368 3.3203 0.0754 -0.0754
## 87 -0.013 4.7580 0.1300 -0.1300
## 115 -0.416 3.8270 0.1130 -0.1130
## 223 -0.228 2.6333 0.0866 -0.0866
## 246 -0.115 2.4714 0.0690 -0.0690
## koi_depth koi_depth_err1 koi_depth_err2 koi_prad koi_prad_err1
## 57 1122.3 53.4 -53.4 1.99 0.09
## 86 1494.7 45.1 -45.1 1.96 0.13
## 87 1394.7 56.4 -56.4 1.83 0.12
## 115 1182.3 46.1 -46.1 1.80 0.10
## 223 767.3 42.9 -42.9 1.30 0.10
## 246 756.1 26.4 -26.4 1.11 0.12
## koi_prad_err2 koi_teq koi_insol koi_insol_err1 koi_insol_err2
## 57 -0.09 332 2.88 0.51 -0.47
## 86 -0.16 361 4.00 0.89 -0.90
## 87 -0.15 273 1.32 0.29 -0.30
## 115 -0.15 329 2.77 0.58 -0.64
## 223 -0.15 328 2.74 0.70 -0.79
## 246 -0.16 332 2.86 0.95 -0.98
## koi_model_snr koi_tce_plnt_num koi_tce_delivname koi_steff
## 57 21.7 3 q1_q17_dr25_tce 4126
## 86 35.2 2 q1_q17_dr25_tce 3950
## 87 26.0 3 q1_q17_dr25_tce 3950
## 115 27.5 1 q1_q17_dr25_tce 3747
## 223 18.2 3 q1_q17_dr25_tce 3713
## 246 30.9 3 q1_q17_dr25_tce 3561
## koi_steff_err1 koi_steff_err2 koi_slogg koi_slogg_err1 koi_slogg_err2
## 57 82 -82 4.661 0.022 -0.022
## 86 70 -86 4.754 0.042 -0.031
## 87 70 -86 4.754 0.042 -0.031
## 115 75 -83 4.728 0.042 -0.025
## 223 74 -92 4.779 0.063 -0.031
## 246 71 -89 4.855 0.055 -0.055
## koi_srad koi_srad_err1 koi_srad_err2 ra dec koi_kepmag
## 57 0.615 0.027 -0.029 286.7380 49.97575 15.095
## 86 0.493 0.033 -0.040 286.0791 39.27832 15.954
## 87 0.493 0.033 -0.040 286.0791 39.27832 15.954
## 115 0.524 0.030 -0.044 283.8664 39.89808 15.414
## 223 0.470 0.036 -0.054 294.7739 43.05630 15.847
## 246 0.411 0.045 -0.060 296.9851 43.65852 15.234
Now let’s visualize the features of the “haitable” planets.
# Temperature distribution
ggplot(habit_df, aes(x = koi_teq, fill = koi_disposition)) + geom_bar(binwidth = 9) + xlab("Effective Temperature (Kelvin)") + labs(title = "Temperature Distribution of Likely Habitable Planets\n") + geom_vline(xintercept=252, colour="orange", linetype = "longdash") + annotate("text", x = 267, y = 6, label = "Effective Temp\nof Earth")
## Warning: `geom_bar()` no longer has a `binwidth` parameter. Please use
## `geom_histogram()` instead.
Observation: it appears that all the Earth-like planets in the dataset are considerably warmer than the Earth.
# Radius distribution
ggplot(habit_df, aes(x = koi_prad, fill = koi_disposition)) + geom_bar(binwidth = 0.1) + xlab("Planetary Radius (Earth Radii)") + labs(title = "Planetary Radius Distribution of Likely Habitable Planets\n") + geom_vline(xintercept=1, colour="orange", linetype = "longdash") + annotate("text", x = 1.14, y = 6, label = "Earth Radius")
## Warning: `geom_bar()` no longer has a `binwidth` parameter. Please use
## `geom_histogram()` instead.
Observation: most of the Earth-like planets in the dataset are substantially larger than the Earth.
Let’s investigage sky-projected distances.
# koi_impact distribution
ggplot(habit_df, aes(x = koi_impact, fill = koi_disposition)) + geom_bar(binwidth = 0.1) + xlab("Sky-Projected Distance") + labs(title = "Sky-Projected Distance Distribution of Likely Habitable Planets\n")
## Warning: `geom_bar()` no longer has a `binwidth` parameter. Please use
## `geom_histogram()` instead.
Observation: The distribution looked negative exponential, and suggested Earth-like planets tended to have smaller sky-projected distances. We were not certain if the negative exponential shape lended itself to any special interpretation, or if koi_impact measures related to any sort of Poisson point process. Such an interpretation was especially difficult to make seeing that we did not really know what sky-projected distance represented.
We did suspect, however, that sky-projected distance was a proxy for the actual distance between a planet and its star. This suspicion arose out of the fact that the “goldilocks” zone for a planet tended to be closer to the star. Therefore we expected habitable planets to be located somewhat closer to stars. A continuation of the investigation into sky-projected distance is detailed below…
But first we finish detailing our analysis of likely habitable planets. Here is a scatterplot of the Earth-like planets.
ggplot(habit_df, aes(x = koi_prad, y = koi_teq, size = koi_score)) + geom_point(aes(color = koi_disposition)) + labs(title = "Plot of Likely Habitable Planets\n") + xlab("Planetary Radius (Earth Radii)") + ylab("Effective Temperature (Kelvin)")
Observation: There’s no information here that was not revealed above. The likely habitable planets in the dataset were typically larger and warmer than the Earth.
Now let’s proceed to better understand sky-projected distance. We had a hypothesis that sky-projected distance was a proxy for actual distance from a star. Luckily, we had planet features that related to the orbital speeds of planets. The laws of physics dictate that planets further out from a star are generally slower moving. So, by plotting sky-projected distance in relation to transit duration, or period between transits, we could see if sky-projected distance increased with larger transit duration.
But first, we examine the distribution of sky-projected distances for all planets.
# Let's take a dataset where we have sky-projected distance data, where the koi_score is fairly high, and where there is no FLASE POSITIVE label. The motivation for this is to remove any erroneous data associated with observations that are not likely to be planets.
test_df <- subset(kepler_df, !is.na(koi_impact) & koi_score > 0.5 & koi_disposition != "FALSE POSITIVE" & koi_disposition != "FALSE POSITIVE" & koi_impact <= 1.0)
ggplot(test_df, aes(x = koi_impact, fill = koi_disposition)) + geom_histogram(binwidth = 0.01) + xlab("Sky-Projected Distance") + labs(title = "Sky-Projected Distance Distribution of Likely Planets\n") + xlim(c(0,1))
summary(test_df$koi_impact)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1030 0.3935 0.4343 0.7480 1.0000
print(mean(test_df$koi_impact))
## [1] 0.434295
print(sd(test_df$koi_impact))
## [1] 0.3308789
As with the likely habitable planets, we see a negative exponential-looking distributiion for sky-projected distance. Therefore the hypothesis mentioned above is shown to be false: it’s not just habitable planets that tend to have smaller koi_impact. Planets, in general, are likely to have small koi_impact values. This puts koi_impact into doubt as a predictor/proxy for actual distance of a planet from a star.
Then we turn our attention to the stars that host Earth-like planets. How did they compare to our Sun?
We plot the star radii v. their photospheric temperature.
temp_df <- habit_df[,c("koi_steff","koi_slogg","koi_srad")]
sun <- c(5778, 2.43775056282, 1.00)
temp_df[dim(temp_df)[1]+1,] <- sun
ggplot(temp_df, aes(x = koi_srad, y = koi_steff)) + geom_point(aes(colour = koi_slogg), size = 5) + scale_colour_gradient2(low = "#FF3300", mid = "white", high = "#663300", midpoint = 3.6) + xlab("Photospheric Radius of the Star (Normalized to Sun's Radius)") + ylab("Photospheric Temperature of the Star (Kelvin)") + labs(title = "Stars with Potentially Habitable Planets\n", color = "Base-10 Log\nof Surface Gravity\nAcceleration") + annotate("text", x = 1.06, y = 5778, label = "Sun") + geom_smooth(method='lm',formula=y~x) + annotate("text", x = 0.5, y = 6000, label = paste("R-square =", round(cor(temp_df$koi_srad,temp_df$koi_steff)^2, digits = 4)))
# Print the correlation
print(cor(temp_df$koi_srad,temp_df$koi_steff))
## [1] 0.9548735
Observation: Most of the stars that host Earth-like planets seemed to be smaller, cooler, and have larger surface accelerations when comapred to our Sun. What was also interesting to see was the fairly high degreee of correlation observed between photospheric radius and temperature of stars hosting Earth-like planets. The correlation table above, built for all candidate planets with a fairly high koi_score, did not suggest this would be the case. But when we look at the subset of stars that host Earth-like planets, the relation is apparent. It’s not clear as to why this is the case, however.
A Kaggle user used right ascension and declination data, the celestial coordinates of the observations, to see where candidates and confirmed planets were being observed. We wanted to do something similar, but use the resulting plot in a different way: to see if Earth-like planets were restricted to certain patches of the night sky.
Add labels to the overall dataset: “Earth-like” and “Not Earth-like”.
for (i in 1:dim(kepler_df)[1]) {
# Do a check for NAs
na_check <- is.na(kepler_df$koi_prad[i]) | is.na(kepler_df$koi_teq[i]) | is.na(kepler_df$koi_score[i]) | is.na(kepler_df$koi_pdisposition[i])
if (!na_check) {
if (kepler_df$koi_prad[i] >= 0.5 & kepler_df$koi_prad[i] <= 2.0 & kepler_df$koi_teq[i] >= 273 & kepler_df$koi_teq[i] <= 373 & kepler_df$koi_pdisposition[i] == "CANDIDATE" & kepler_df$koi_score[i] >= 0.4) {
kepler_df$koi_els[i] <- "Earth-like"
} else {
kepler_df$koi_els[i] <- "Not Earth-like"
}
}
}
Now we plot celestial coordinate data overlaid with with the new koi_els feature information.
el_df <- subset(kepler_df, koi_els == "Earth-like")
ggplot(kepler_df, aes(x = ra, y = dec)) + geom_point(aes(colour = koi_els), size = 1.5) + xlab("Right Ascension") + ylab("Declination") + labs(title = "Celestial Positioning of Observations\n")
As suspected, the observations corresponding to Earth-like planets are spread out accross the patches of celestial coordinates observed by Kepler. Another hypothesis was that perhaps the patches with a higher density of observations would also have more observations of Earth-like planets. This does not seem to be the case, however.
starData <- read.csv("cumulative.csv", header = TRUE)
head(starData)
## rowid kepid kepoi_name kepler_name koi_disposition koi_pdisposition
## 1 1 10797460 K00752.01 Kepler-227 b CONFIRMED CANDIDATE
## 2 2 10797460 K00752.02 Kepler-227 c CONFIRMED CANDIDATE
## 3 3 10811496 K00753.01 FALSE POSITIVE FALSE POSITIVE
## 4 4 10848459 K00754.01 FALSE POSITIVE FALSE POSITIVE
## 5 5 10854555 K00755.01 Kepler-664 b CONFIRMED CANDIDATE
## 6 6 10872983 K00756.01 Kepler-228 d CONFIRMED CANDIDATE
## koi_score koi_fpflag_nt koi_fpflag_ss koi_fpflag_co koi_fpflag_ec
## 1 1.000 0 0 0 0
## 2 0.969 0 0 0 0
## 3 0.000 0 1 0 0
## 4 0.000 0 1 0 0
## 5 1.000 0 0 0 0
## 6 1.000 0 0 0 0
## koi_period koi_period_err1 koi_period_err2 koi_time0bk koi_time0bk_err1
## 1 9.488036 2.775e-05 -2.775e-05 170.5387 0.002160
## 2 54.418383 2.479e-04 -2.479e-04 162.5138 0.003520
## 3 19.899140 1.494e-05 -1.494e-05 175.8503 0.000581
## 4 1.736952 2.630e-07 -2.630e-07 170.3076 0.000115
## 5 2.525592 3.761e-06 -3.761e-06 171.5956 0.001130
## 6 11.094321 2.036e-05 -2.036e-05 171.2012 0.001410
## koi_time0bk_err2 koi_impact koi_impact_err1 koi_impact_err2 koi_duration
## 1 -0.002160 0.146 0.318 -0.146 2.95750
## 2 -0.003520 0.586 0.059 -0.443 4.50700
## 3 -0.000581 0.969 5.126 -0.077 1.78220
## 4 -0.000115 1.276 0.115 -0.092 2.40641
## 5 -0.001130 0.701 0.235 -0.478 1.65450
## 6 -0.001410 0.538 0.030 -0.428 4.59450
## koi_duration_err1 koi_duration_err2 koi_depth koi_depth_err1
## 1 0.08190 -0.08190 615.8 19.5
## 2 0.11600 -0.11600 874.8 35.5
## 3 0.03410 -0.03410 10829.0 171.0
## 4 0.00537 -0.00537 8079.2 12.8
## 5 0.04200 -0.04200 603.3 16.9
## 6 0.06100 -0.06100 1517.5 24.2
## koi_depth_err2 koi_prad koi_prad_err1 koi_prad_err2 koi_teq koi_teq_err1
## 1 -19.5 2.26 0.26 -0.15 793 NA
## 2 -35.5 2.83 0.32 -0.19 443 NA
## 3 -171.0 14.60 3.92 -1.31 638 NA
## 4 -12.8 33.46 8.50 -2.83 1395 NA
## 5 -16.9 2.75 0.88 -0.35 1406 NA
## 6 -24.2 3.90 1.27 -0.42 835 NA
## koi_teq_err2 koi_insol koi_insol_err1 koi_insol_err2 koi_model_snr
## 1 NA 93.59 29.45 -16.65 35.8
## 2 NA 9.11 2.87 -1.62 25.8
## 3 NA 39.30 31.04 -10.49 76.3
## 4 NA 891.96 668.95 -230.35 505.6
## 5 NA 926.16 874.33 -314.24 40.9
## 6 NA 114.81 112.85 -36.70 66.5
## koi_tce_plnt_num koi_tce_delivname koi_steff koi_steff_err1
## 1 1 q1_q17_dr25_tce 5455 81
## 2 2 q1_q17_dr25_tce 5455 81
## 3 1 q1_q17_dr25_tce 5853 158
## 4 1 q1_q17_dr25_tce 5805 157
## 5 1 q1_q17_dr25_tce 6031 169
## 6 1 q1_q17_dr25_tce 6046 189
## koi_steff_err2 koi_slogg koi_slogg_err1 koi_slogg_err2 koi_srad
## 1 -81 4.467 0.064 -0.096 0.927
## 2 -81 4.467 0.064 -0.096 0.927
## 3 -176 4.544 0.044 -0.176 0.868
## 4 -174 4.564 0.053 -0.168 0.791
## 5 -211 4.438 0.070 -0.210 1.046
## 6 -232 4.486 0.054 -0.229 0.972
## koi_srad_err1 koi_srad_err2 ra dec koi_kepmag
## 1 0.105 -0.061 291.9342 48.14165 15.347
## 2 0.105 -0.061 291.9342 48.14165 15.347
## 3 0.233 -0.078 297.0048 48.13413 15.436
## 4 0.201 -0.067 285.5346 48.28521 15.597
## 5 0.334 -0.133 288.7549 48.22620 15.509
## 6 0.315 -0.105 296.2861 48.22467 15.714
Assigning NA to the blank values in the entire dataset
starData[starData ==""] <- NA
List the name and number of the columns that have at least one missing value.
naCol <- which(colMeans(is.na(starData))>0)
naCol
## kepler_name koi_score koi_period_err1 koi_period_err2
## 4 7 13 14
## koi_time0bk_err1 koi_time0bk_err2 koi_impact koi_impact_err1
## 16 17 18 19
## koi_impact_err2 koi_duration_err1 koi_duration_err2 koi_depth
## 20 22 23 24
## koi_depth_err1 koi_depth_err2 koi_prad koi_prad_err1
## 25 26 27 28
## koi_prad_err2 koi_teq koi_teq_err1 koi_teq_err2
## 29 30 31 32
## koi_insol koi_insol_err1 koi_insol_err2 koi_model_snr
## 33 34 35 36
## koi_tce_plnt_num koi_tce_delivname koi_steff koi_steff_err1
## 37 38 39 40
## koi_steff_err2 koi_slogg koi_slogg_err1 koi_slogg_err2
## 41 42 43 44
## koi_srad koi_srad_err1 koi_srad_err2 koi_kepmag
## 45 46 47 50
naVal<- vector()
for (colnum in 1:50) {
naVal[colnum] <- sum(complete.cases(starData[colnum])==FALSE)
}
naVal<-naVal[naVal!=0]
NaData <- data.frame(naCol,naVal)
barplot(NaData$naVal,main = "Missing Value Counts", names.arg = NaData$naCol, cex.names = 0.5,
xlab="column names", col="red")
titleLabels[4]
## [1] "kepler_name"
titleLabels[7]
## [1] "koi_score"
titleLabels[31]
## [1] "koi_teq_err1"
titleLabels[32]
## [1] "koi_teq_err2"
As shown above, the columns/features with the most missing values are ranked as follows: 1.“koi_teq_err1” & “koi_teq_err2” 2.“kepler_name” 3.“koi_score”
Since there are many occurences of missing values in many of the columns, it is unreasonable to delete all the missing value data. Disregarding the top 4 columns with the most missing values, there seems to be 10% of missing values for many of the columns. If we were to delete all the rows with at least one missing data, we wouldn’t be removing 10% of the data, it would be close to 50% since the missing values are not all located in the same rows.
Instead of remvoing all the rows with missing values, we did two different things: insert the mean to the missing numerical values and insert the median to the missing categorical data.
starData<- starData[-c(4,7,31,32)]
levels(starData$koi_tce_delivname)
## [1] "" "q1_q16_tce" "q1_q17_dr24_tce" "q1_q17_dr25_tce"
levels(starData$koi_tce_delivname)
## [1] "" "q1_q16_tce" "q1_q17_dr24_tce" "q1_q17_dr25_tce"
dataA <- subset(starData,starData$koi_tce_delivname =="q1_q16_tce")
dataB <- subset(starData,starData$koi_tce_delivname =="q1_q17_dr24_tce")
dataC <- subset(starData,starData$koi_tce_delivname =="q1_q17_dr25_tce")
dim(dataA)[1]
## [1] 796
dim(dataB)[1]
## [1] 368
dim(dataC)[1]
## [1] 8054
starData$koi_tce_delivname[is.na(starData$koi_tce_delivname)] <- "q1_q17_dr25_tce"
starData$koi_tce_delivname <- factor(starData$koi_tce_delivname)
levels(starData$koi_tce_delivname)
## [1] "q1_q16_tce" "q1_q17_dr24_tce" "q1_q17_dr25_tce"
starData$koi_tce_plnt_num <- as.factor(starData$koi_tce_plnt_num)
levels(starData$koi_tce_plnt_num)
## [1] "1" "2" "3" "4" "5" "6" "7" "8"
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Mode(starData$koi_tce_plnt_num)
## [1] 1
## Levels: 1 2 3 4 5 6 7 8
starData$koi_tce_plnt_num[is.na(starData$koi_tce_plnt_num)] <- "1"
for(i in 6:32){
starData
starData[is.na(starData[,i]), i] <- mean(starData[,i], na.rm = TRUE)
}
for(i in 35:46){
starData
starData[is.na(starData[,i]), i] <- mean(starData[,i], na.rm = TRUE)
}
sum(complete.cases(starData)=="FALSE")
## [1] 0
starData <- starData[-c(1:3,5)]
starData$koi_fpflag_co <- as.factor(starData$koi_fpflag_co)
starData$koi_fpflag_ec <- as.factor(starData$koi_fpflag_ec)
starData$koi_fpflag_nt <- as.factor(starData$koi_fpflag_nt)
starData$koi_fpflag_ss <- as.factor(starData$koi_fpflag_ss)
starData$koi_tce_delivname <- as.factor(starData$koi_tce_delivname)
starData$koi_tce_plnt_num <- as.factor(starData$koi_tce_plnt_num)
Rearrange Dataset that it seperates categorical variables and numerical variables
starData <- starData[c(1:5,29,30,6:28,31:42)]
head(starData)
## koi_disposition koi_fpflag_nt koi_fpflag_ss koi_fpflag_co koi_fpflag_ec
## 1 CONFIRMED 0 0 0 0
## 2 CONFIRMED 0 0 0 0
## 3 FALSE POSITIVE 0 1 0 0
## 4 FALSE POSITIVE 0 1 0 0
## 5 CONFIRMED 0 0 0 0
## 6 CONFIRMED 0 0 0 0
## koi_tce_plnt_num koi_tce_delivname koi_period koi_period_err1
## 1 1 q1_q17_dr25_tce 9.488036 2.775e-05
## 2 2 q1_q17_dr25_tce 54.418383 2.479e-04
## 3 1 q1_q17_dr25_tce 19.899140 1.494e-05
## 4 1 q1_q17_dr25_tce 1.736952 2.630e-07
## 5 1 q1_q17_dr25_tce 2.525592 3.761e-06
## 6 1 q1_q17_dr25_tce 11.094321 2.036e-05
## koi_period_err2 koi_time0bk koi_time0bk_err1 koi_time0bk_err2 koi_impact
## 1 -2.775e-05 170.5387 0.002160 -0.002160 0.146
## 2 -2.479e-04 162.5138 0.003520 -0.003520 0.586
## 3 -1.494e-05 175.8503 0.000581 -0.000581 0.969
## 4 -2.630e-07 170.3076 0.000115 -0.000115 1.276
## 5 -3.761e-06 171.5956 0.001130 -0.001130 0.701
## 6 -2.036e-05 171.2012 0.001410 -0.001410 0.538
## koi_impact_err1 koi_impact_err2 koi_duration koi_duration_err1
## 1 0.318 -0.146 2.95750 0.08190
## 2 0.059 -0.443 4.50700 0.11600
## 3 5.126 -0.077 1.78220 0.03410
## 4 0.115 -0.092 2.40641 0.00537
## 5 0.235 -0.478 1.65450 0.04200
## 6 0.030 -0.428 4.59450 0.06100
## koi_duration_err2 koi_depth koi_depth_err1 koi_depth_err2 koi_prad
## 1 -0.08190 615.8 19.5 -19.5 2.26
## 2 -0.11600 874.8 35.5 -35.5 2.83
## 3 -0.03410 10829.0 171.0 -171.0 14.60
## 4 -0.00537 8079.2 12.8 -12.8 33.46
## 5 -0.04200 603.3 16.9 -16.9 2.75
## 6 -0.06100 1517.5 24.2 -24.2 3.90
## koi_prad_err1 koi_prad_err2 koi_teq koi_insol koi_insol_err1
## 1 0.26 -0.15 793 93.59 29.45
## 2 0.32 -0.19 443 9.11 2.87
## 3 3.92 -1.31 638 39.30 31.04
## 4 8.50 -2.83 1395 891.96 668.95
## 5 0.88 -0.35 1406 926.16 874.33
## 6 1.27 -0.42 835 114.81 112.85
## koi_insol_err2 koi_model_snr koi_steff koi_steff_err1 koi_steff_err2
## 1 -16.65 35.8 5455 81 -81
## 2 -1.62 25.8 5455 81 -81
## 3 -10.49 76.3 5853 158 -176
## 4 -230.35 505.6 5805 157 -174
## 5 -314.24 40.9 6031 169 -211
## 6 -36.70 66.5 6046 189 -232
## koi_slogg koi_slogg_err1 koi_slogg_err2 koi_srad koi_srad_err1
## 1 4.467 0.064 -0.096 0.927 0.105
## 2 4.467 0.064 -0.096 0.927 0.105
## 3 4.544 0.044 -0.176 0.868 0.233
## 4 4.564 0.053 -0.168 0.791 0.201
## 5 4.438 0.070 -0.210 1.046 0.334
## 6 4.486 0.054 -0.229 0.972 0.315
## koi_srad_err2 ra dec koi_kepmag
## 1 -0.061 291.9342 48.14165 15.347
## 2 -0.061 291.9342 48.14165 15.347
## 3 -0.078 297.0048 48.13413 15.436
## 4 -0.067 285.5346 48.28521 15.597
## 5 -0.133 288.7549 48.22620 15.509
## 6 -0.105 296.2861 48.22467 15.714
num_samples = dim(starData)[1]
sampling.rate = 0.8
training <- sample(1:num_samples, sampling.rate * num_samples, replace=FALSE)
trainingSet <- subset(starData[training, ])
testing <- setdiff(1:num_samples,training)
testingSet <- subset(starData[testing, ])
names(trainingSet)
## [1] "koi_disposition" "koi_fpflag_nt" "koi_fpflag_ss"
## [4] "koi_fpflag_co" "koi_fpflag_ec" "koi_tce_plnt_num"
## [7] "koi_tce_delivname" "koi_period" "koi_period_err1"
## [10] "koi_period_err2" "koi_time0bk" "koi_time0bk_err1"
## [13] "koi_time0bk_err2" "koi_impact" "koi_impact_err1"
## [16] "koi_impact_err2" "koi_duration" "koi_duration_err1"
## [19] "koi_duration_err2" "koi_depth" "koi_depth_err1"
## [22] "koi_depth_err2" "koi_prad" "koi_prad_err1"
## [25] "koi_prad_err2" "koi_teq" "koi_insol"
## [28] "koi_insol_err1" "koi_insol_err2" "koi_model_snr"
## [31] "koi_steff" "koi_steff_err1" "koi_steff_err2"
## [34] "koi_slogg" "koi_slogg_err1" "koi_slogg_err2"
## [37] "koi_srad" "koi_srad_err1" "koi_srad_err2"
## [40] "ra" "dec" "koi_kepmag"
The koi_disposition has three different categories as mentioned above: “CANDIDATE”," “CONFIRMED”, “FALSE POSITIVE”. Blank values are classified as “NOT DISPOSITIONED” which will be ignored. These are the results from historical dispositions in literature for exoplanet candidates. KOI means Kepler’s “object of interest” which is comprised of the planets that Kepler has found. The objective of the model is to predict whether a KOI is a candidate, confirmed or false positive.
For this problem we conducted the following data science models: - Decision Tree - Randomforest - KNN - SVM - Neural Network Linear regression was ignored since there were many categorical variables. Logistic regression was ignored since there are three different categories of koi_disposition which makes it complex to analyze.
decTreeModel <- rpart(koi_disposition ~ .,data=trainingSet,method = "class")
prp(decTreeModel)
plotcp(decTreeModel)
pruned_decTreeModel = prune(decTreeModel, cp=0.012)
prp(pruned_decTreeModel)
As shown above, the most imporant factors and characteristics in determining the classification of a star is ranked as follows: 1. koi_fpflag_s 2. koi_fpflag_n 3. koi_fpflag_C 4. koi_model_sn 5. koi_fpflag_e 6. koi_prad_err The decision tree makes it very easy to understand and visualize the important aspects in this problem. It was very first fast to implement as it can handle both categorical and numerical data.
predictedLabels<-predict(pruned_decTreeModel, testingSet, type = "class")
sizeTestSet = dim(testingSet)[1]
error = sum(predictedLabels != testingSet$koi_disposition)
misclassification_rate = error/sizeTestSet
print(misclassification_rate)
## [1] 0.1474124
RandForestModel <- randomForest(koi_disposition ~ .,data=trainingSet)
plot(RandForestModel)
legend("top", colnames(RandForestModel$err.rate),fill=1:3)
predictedLabels<-predict(RandForestModel, testingSet)
sizeTestSet = dim(testingSet)[1]
error = sum(predictedLabels != testingSet$koi_disposition)
misclassification_rate = error/sizeTestSet
print(misclassification_rate)
## [1] 0.1092525
The randomforest model had a lower misclassification rate than the decision tree. Decision trees are prone to overfitting. Randomforest models mititages overfitting and can lead to more accurate classification and prediction which is seen in this case.
Normalize all data
starData[8:42] <- scale(starData[8:42])
change koi_tce_delivname into numerical values
levels(starData$koi_tce_delivname)
## [1] "q1_q16_tce" "q1_q17_dr24_tce" "q1_q17_dr25_tce"
levels(starData$koi_tce_delivname)[levels(starData$koi_tce_delivname)=="q1_q16_tce"] <- "1"
levels(starData$koi_tce_delivname)[levels(starData$koi_tce_delivname)=="q1_q17_dr24_tce"] <- "2"
levels(starData$koi_tce_delivname)[levels(starData$koi_tce_delivname)=="q1_q17_dr25_tce"] <- "3"
starData$koi_tce_delivname[starData$koi_tce_delivname== "q1_q16_tce"] <- "1"
starData$koi_tce_delivname[starData$koi_tce_delivname== "q1_q17_dr24_tce"] <- "2"
starData$koi_tce_delivname[starData$koi_tce_delivname== "q1_q17_dr25_tce"] <- "3"
levels(starData$koi_tce_delivname)
## [1] "1" "2" "3"
num_samples = dim(starData)[1]
sampling.rate = 0.8
training <- sample(1:num_samples, sampling.rate * num_samples, replace=FALSE)
trainingSet <- starData[training, ]
testing <- setdiff(1:num_samples,training)
testingSet <- starData[testing, ]
trainingfeatures <- subset(trainingSet, select=c(-koi_disposition))
traininglabels <- trainingSet$koi_disposition
testingfeatures <- subset(testingSet, select=c(-koi_disposition))
currentBestError = Inf
currentBestVar = -1
for(i in 1:30) {
predictedLabels = knn(trainingfeatures,testingfeatures,traininglabels,k=i)
error = sum(predictedLabels != testingSet$koi_disposition)
if(error < currentBestError){
print(paste0("We found a better k: ",i))
currentBestError = error
currentBestVar = i
}
}
## [1] "We found a better k: 1"
## [1] "We found a better k: 3"
## [1] "We found a better k: 6"
## [1] "We found a better k: 13"
currentBestVar
## [1] 13
currentBestError / (dim(testingSet)[1])
## [1] 0.2294825
AllErrors=c()
for(fold in 1:50)
{
#Get Training at Testing sets
num_samples = dim(starData)[1]
sampling.rate = 0.8
training <- sample(1:num_samples, sampling.rate * num_samples, replace=FALSE)
trainingSet <- starData[training, ]
testing <- setdiff(1:num_samples,training)
testingSet <- starData[testing, ]
trainingfeatures <- subset(trainingSet, select=c(-koi_disposition))
traininglabels <- trainingSet$koi_disposition
testingfeatures <- subset(testingSet, select=c(-koi_disposition))
predictedLabels = knn(trainingfeatures,testingfeatures,traininglabels,k=currentBestVar)
error = sum(predictedLabels != testingSet$koi_disposition)
errorRate <- error / (dim(testingSet)[1])
AllErrors[fold] = errorRate
}
AverageError = mean(AllErrors)
AverageError
## [1] 0.2261579
By conducting a KNN cross validation analysis, we are able to find the average error which is a more accurate result than doing one test.
starData$koi_disposition <- as.factor(starData$koi_disposition)
levels(starData$koi_disposition)
## [1] "CANDIDATE" "CONFIRMED" "FALSE POSITIVE"
This model takes a long calculation time, so insert print(i) to check the progress ideally, we would increase the range of cost testing, but to consider the process time, we choose 15-20 to demonstrate the concept. Feel free to edit the range for for-loop during assessment
currentBestError = Inf
currentBestVar = -1
for(i in 15:20) {
svmModel <- svm(koi_disposition~., data=trainingSet, kernel="linear", cost=i)
error = sum(predictedLabels != testingSet$koi_disposition)
print(i)
if(error < currentBestError){
print(paste0("We found a better cost: ",i))
currentBestError = error
currentBestVar = i
}
}
## [1] 15
## [1] "We found a better cost: 15"
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
currentBestVar
## [1] 15
currentBestError / (dim(testingSet)[1])
## [1] 0.2195504
currentBestError = Inf
currentBestVar = -1
for(i in 15:20) {
svmModel <- svm(koi_disposition~., data=trainingSet, kernel="polynomial", cost=i)
error = sum(predictedLabels != testingSet$koi_disposition)
print(i)
if(error < currentBestError){
print(paste0("We found a better cost: ",i))
currentBestError = error
currentBestVar = i
}
}
## [1] 15
## [1] "We found a better cost: 15"
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
currentBestVar
## [1] 15
currentBestError / (dim(testingSet)[1])
## [1] 0.2195504
currentBestError = Inf
currentBestVar = -1
for(i in 15:20) {
svmModel <- svm(koi_disposition~., data=trainingSet, kernel="radial", cost=i)
error = sum(predictedLabels != testingSet$koi_disposition)
print(i)
if(error < currentBestError){
print(paste0("We found a better cost: ",i))
currentBestError = error
currentBestVar = i
}
}
## [1] 15
## [1] "We found a better cost: 15"
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
currentBestVar
## [1] 15
currentBestError / (dim(testingSet)[1])
## [1] 0.2195504
Using a more complex machine learning algorithm,
head(starData)
## koi_disposition koi_fpflag_nt koi_fpflag_ss koi_fpflag_co koi_fpflag_ec
## 1 CONFIRMED 0 0 0 0
## 2 CONFIRMED 0 0 0 0
## 3 FALSE POSITIVE 0 1 0 0
## 4 FALSE POSITIVE 0 1 0 0
## 5 CONFIRMED 0 0 0 0
## 6 CONFIRMED 0 0 0 0
## koi_tce_plnt_num koi_tce_delivname koi_period koi_period_err1
## 1 1 3 -0.04958503 -0.2637455
## 2 2 3 -0.01592288 -0.2363585
## 3 1 3 -0.04178495 -0.2653391
## 4 1 3 -0.05539220 -0.2671649
## 5 1 3 -0.05480134 -0.2667298
## 6 1 3 -0.04838159 -0.2646648
## koi_period_err2 koi_time0bk koi_time0bk_err1 koi_time0bk_err2
## 1 0.2637455 0.06412788 -0.3447962 0.3447962
## 2 0.2363585 -0.05402631 -0.2844657 0.2844657
## 3 0.2653391 0.14233141 -0.4148417 0.4148417
## 4 0.2671649 0.06072405 -0.4355138 0.4355138
## 5 0.2667298 0.07968760 -0.3904877 0.3904877
## 6 0.2646648 0.07388083 -0.3780667 0.3780667
## koi_impact koi_impact_err1 koi_impact_err2 koi_duration
## 1 -0.17935061 -0.1785545 0.15294104 -0.4116641
## 2 -0.04539452 -0.2067211 -0.09054152 -0.1722317
## 3 0.07120817 0.3443219 0.20950769 -0.5932743
## 4 0.16467299 -0.2006311 0.19721059 -0.4968199
## 5 -0.01038326 -0.1875809 -0.11923475 -0.6130068
## 6 -0.06000791 -0.2098749 -0.07824442 -0.1587109
## koi_duration_err1 koi_duration_err2 koi_depth koi_depth_err1
## 1 -0.3947219 0.3947219 -0.2873000 -0.02583522
## 2 -0.3425597 0.3425597 -0.2840893 -0.02184898
## 3 -0.4678408 0.4678408 -0.1606901 0.01190950
## 4 -0.5117886 0.5117886 -0.1947785 -0.02750446
## 5 -0.4557563 0.4557563 -0.2874550 -0.02648299
## 6 -0.4266923 0.4266923 -0.2761219 -0.02466426
## koi_depth_err2 koi_prad koi_prad_err1 koi_prad_err2 koi_teq
## 1 0.02583522 -0.03333655 -0.04534862 0.02808129 -0.3481029
## 2 0.02184898 -0.03314772 -0.04519222 0.02804712 -0.7647988
## 3 -0.01190950 -0.02924864 -0.03580850 0.02709038 -0.5326397
## 4 0.02750446 -0.02300084 -0.02387032 0.02579196 0.3686142
## 5 0.02648299 -0.03317422 -0.04373253 0.02791044 0.3817104
## 6 0.02466426 -0.03279326 -0.04271596 0.02785064 -0.2980993
## koi_insol koi_insol_err1 koi_insol_err2 koi_model_snr koi_steff
## 1 -0.04889243 -0.06876874 0.04634332 -0.2870964 -0.3221945
## 2 -0.04943220 -0.06925994 0.04651629 -0.2999078 -0.3221945
## 3 -0.04923931 -0.06873936 0.04641421 -0.2352104 0.1870253
## 4 -0.04379134 -0.05695077 0.04388395 0.3147818 0.1256119
## 5 -0.04357283 -0.05315534 0.04291850 -0.2805626 0.4147669
## 6 -0.04875685 -0.06722751 0.04611257 -0.2477655 0.4339586
## koi_steff_err1 koi_steff_err2 koi_slogg koi_slogg_err1 koi_slogg_err2
## 1 -1.3868026 1.1464281 0.3696378 -0.4379769 0.5657535
## 2 -1.3868026 1.1464281 0.3696378 -0.4379769 0.5657535
## 3 0.2912499 -0.1937625 0.5511062 -0.5923628 -0.3939537
## 4 0.2694570 -0.1655480 0.5982408 -0.5228891 -0.2979830
## 5 0.5309717 -0.6875169 0.3012926 -0.3916611 -0.8018293
## 6 0.9668295 -0.9837696 0.4144157 -0.5151698 -1.0297598
## koi_srad koi_srad_err1 koi_srad_err2 ra dec koi_kepmag
## 1 -0.1334014 -0.28342173 0.1578657 -0.02641956 1.202701 0.7813000
## 2 -0.1334014 -0.28342173 0.1578657 -0.02641956 1.202701 0.7813000
## 3 -0.1432187 -0.14242259 0.1498259 1.03734270 1.200612 0.8455425
## 4 -0.1560312 -0.17767238 0.1550281 -1.36899986 1.242565 0.9617565
## 5 -0.1136003 -0.03116546 0.1238150 -0.69341739 1.226179 0.8982358
## 6 -0.1259136 -0.05209502 0.1370569 0.88656827 1.225754 1.0462101
levels(starData$koi_disposition)
## [1] "CANDIDATE" "CONFIRMED" "FALSE POSITIVE"
levels(starData$koi_disposition)[levels(starData$koi_disposition)=="CANDIDATE"] <- "1"
levels(starData$koi_disposition)[levels(starData$koi_disposition)== "CONFIRMED"] <- "2"
levels(starData$koi_disposition)[levels(starData$koi_disposition)== "FALSE POSITIVE"] <- "3"
starData$koi_disposition[starData$koi_disposition=="CANDIDATE"] <- "1"
starData$koi_disposition[starData$koi_disposition== "CONFIRMED"] <- "2"
starData$koi_disposition[starData$koi_disposition== "FALSE POSITIVE"] <- "3"
starData$koi_disposition <- as.numeric(starData$koi_disposition)
str(starData)
## 'data.frame': 9564 obs. of 42 variables:
## $ koi_disposition : num 2 2 3 3 2 2 2 2 3 2 ...
## $ koi_fpflag_nt : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ koi_fpflag_ss : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 1 2 1 ...
## $ koi_fpflag_co : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
## $ koi_fpflag_ec : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ koi_tce_plnt_num : Factor w/ 8 levels "1","2","3","4",..: 1 2 1 1 1 1 2 3 1 1 ...
## $ koi_tce_delivname: Factor w/ 3 levels "1","2","3": 3 3 3 3 3 3 3 3 3 3 ...
## $ koi_period : num -0.0496 -0.0159 -0.0418 -0.0554 -0.0548 ...
## $ koi_period_err1 : num -0.264 -0.236 -0.265 -0.267 -0.267 ...
## $ koi_period_err2 : num 0.264 0.236 0.265 0.267 0.267 ...
## $ koi_time0bk : num 0.0641 -0.054 0.1423 0.0607 0.0797 ...
## $ koi_time0bk_err1 : num -0.345 -0.284 -0.415 -0.436 -0.39 ...
## $ koi_time0bk_err2 : num 0.345 0.284 0.415 0.436 0.39 ...
## $ koi_impact : num -0.1794 -0.0454 0.0712 0.1647 -0.0104 ...
## $ koi_impact_err1 : num -0.179 -0.207 0.344 -0.201 -0.188 ...
## $ koi_impact_err2 : num 0.1529 -0.0905 0.2095 0.1972 -0.1192 ...
## $ koi_duration : num -0.412 -0.172 -0.593 -0.497 -0.613 ...
## $ koi_duration_err1: num -0.395 -0.343 -0.468 -0.512 -0.456 ...
## $ koi_duration_err2: num 0.395 0.343 0.468 0.512 0.456 ...
## $ koi_depth : num -0.287 -0.284 -0.161 -0.195 -0.287 ...
## $ koi_depth_err1 : num -0.0258 -0.0218 0.0119 -0.0275 -0.0265 ...
## $ koi_depth_err2 : num 0.0258 0.0218 -0.0119 0.0275 0.0265 ...
## $ koi_prad : num -0.0333 -0.0331 -0.0292 -0.023 -0.0332 ...
## $ koi_prad_err1 : num -0.0453 -0.0452 -0.0358 -0.0239 -0.0437 ...
## $ koi_prad_err2 : num 0.0281 0.028 0.0271 0.0258 0.0279 ...
## $ koi_teq : num -0.348 -0.765 -0.533 0.369 0.382 ...
## $ koi_insol : num -0.0489 -0.0494 -0.0492 -0.0438 -0.0436 ...
## $ koi_insol_err1 : num -0.0688 -0.0693 -0.0687 -0.057 -0.0532 ...
## $ koi_insol_err2 : num 0.0463 0.0465 0.0464 0.0439 0.0429 ...
## $ koi_model_snr : num -0.287 -0.3 -0.235 0.315 -0.281 ...
## $ koi_steff : num -0.322 -0.322 0.187 0.126 0.415 ...
## $ koi_steff_err1 : num -1.387 -1.387 0.291 0.269 0.531 ...
## $ koi_steff_err2 : num 1.146 1.146 -0.194 -0.166 -0.688 ...
## $ koi_slogg : num 0.37 0.37 0.551 0.598 0.301 ...
## $ koi_slogg_err1 : num -0.438 -0.438 -0.592 -0.523 -0.392 ...
## $ koi_slogg_err2 : num 0.566 0.566 -0.394 -0.298 -0.802 ...
## $ koi_srad : num -0.133 -0.133 -0.143 -0.156 -0.114 ...
## $ koi_srad_err1 : num -0.2834 -0.2834 -0.1424 -0.1777 -0.0312 ...
## $ koi_srad_err2 : num 0.158 0.158 0.15 0.155 0.124 ...
## $ ra : num -0.0264 -0.0264 1.0373 -1.369 -0.6934 ...
## $ dec : num 1.2 1.2 1.2 1.24 1.23 ...
## $ koi_kepmag : num 0.781 0.781 0.846 0.962 0.898 ...
starData$koi_fpflag_nt <- as.numeric(starData$koi_fpflag_nt)
starData$koi_fpflag_nt <- as.numeric(starData$koi_fpflag_nt)
starData$koi_fpflag_co <- as.numeric(starData$koi_fpflag_co)
starData$koi_fpflag_ec <- as.numeric(starData$koi_fpflag_ec)
starData$koi_fpflag_ss <- as.numeric(starData$koi_fpflag_ss)
starData$koi_tce_delivname <- as.numeric(starData$koi_tce_delivname)
starData$koi_tce_plnt_num <- as.numeric(starData$koi_tce_plnt_num)
str(starData)
## 'data.frame': 9564 obs. of 42 variables:
## $ koi_disposition : num 2 2 3 3 2 2 2 2 3 2 ...
## $ koi_fpflag_nt : num 1 1 1 1 1 1 1 1 1 1 ...
## $ koi_fpflag_ss : num 1 1 2 2 1 1 1 1 2 1 ...
## $ koi_fpflag_co : num 1 1 1 1 1 1 1 1 2 1 ...
## $ koi_fpflag_ec : num 1 1 1 1 1 1 1 1 1 1 ...
## $ koi_tce_plnt_num : num 1 2 1 1 1 1 2 3 1 1 ...
## $ koi_tce_delivname: num 3 3 3 3 3 3 3 3 3 3 ...
## $ koi_period : num -0.0496 -0.0159 -0.0418 -0.0554 -0.0548 ...
## $ koi_period_err1 : num -0.264 -0.236 -0.265 -0.267 -0.267 ...
## $ koi_period_err2 : num 0.264 0.236 0.265 0.267 0.267 ...
## $ koi_time0bk : num 0.0641 -0.054 0.1423 0.0607 0.0797 ...
## $ koi_time0bk_err1 : num -0.345 -0.284 -0.415 -0.436 -0.39 ...
## $ koi_time0bk_err2 : num 0.345 0.284 0.415 0.436 0.39 ...
## $ koi_impact : num -0.1794 -0.0454 0.0712 0.1647 -0.0104 ...
## $ koi_impact_err1 : num -0.179 -0.207 0.344 -0.201 -0.188 ...
## $ koi_impact_err2 : num 0.1529 -0.0905 0.2095 0.1972 -0.1192 ...
## $ koi_duration : num -0.412 -0.172 -0.593 -0.497 -0.613 ...
## $ koi_duration_err1: num -0.395 -0.343 -0.468 -0.512 -0.456 ...
## $ koi_duration_err2: num 0.395 0.343 0.468 0.512 0.456 ...
## $ koi_depth : num -0.287 -0.284 -0.161 -0.195 -0.287 ...
## $ koi_depth_err1 : num -0.0258 -0.0218 0.0119 -0.0275 -0.0265 ...
## $ koi_depth_err2 : num 0.0258 0.0218 -0.0119 0.0275 0.0265 ...
## $ koi_prad : num -0.0333 -0.0331 -0.0292 -0.023 -0.0332 ...
## $ koi_prad_err1 : num -0.0453 -0.0452 -0.0358 -0.0239 -0.0437 ...
## $ koi_prad_err2 : num 0.0281 0.028 0.0271 0.0258 0.0279 ...
## $ koi_teq : num -0.348 -0.765 -0.533 0.369 0.382 ...
## $ koi_insol : num -0.0489 -0.0494 -0.0492 -0.0438 -0.0436 ...
## $ koi_insol_err1 : num -0.0688 -0.0693 -0.0687 -0.057 -0.0532 ...
## $ koi_insol_err2 : num 0.0463 0.0465 0.0464 0.0439 0.0429 ...
## $ koi_model_snr : num -0.287 -0.3 -0.235 0.315 -0.281 ...
## $ koi_steff : num -0.322 -0.322 0.187 0.126 0.415 ...
## $ koi_steff_err1 : num -1.387 -1.387 0.291 0.269 0.531 ...
## $ koi_steff_err2 : num 1.146 1.146 -0.194 -0.166 -0.688 ...
## $ koi_slogg : num 0.37 0.37 0.551 0.598 0.301 ...
## $ koi_slogg_err1 : num -0.438 -0.438 -0.592 -0.523 -0.392 ...
## $ koi_slogg_err2 : num 0.566 0.566 -0.394 -0.298 -0.802 ...
## $ koi_srad : num -0.133 -0.133 -0.143 -0.156 -0.114 ...
## $ koi_srad_err1 : num -0.2834 -0.2834 -0.1424 -0.1777 -0.0312 ...
## $ koi_srad_err2 : num 0.158 0.158 0.15 0.155 0.124 ...
## $ ra : num -0.0264 -0.0264 1.0373 -1.369 -0.6934 ...
## $ dec : num 1.2 1.2 1.2 1.24 1.23 ...
## $ koi_kepmag : num 0.781 0.781 0.846 0.962 0.898 ...
num_samples = dim(starData)[1]
sampling.rate = 0.8
training <- sample(1:num_samples, sampling.rate * num_samples, replace=FALSE)
trainingSet <- subset(starData[training, ])
testing <- setdiff(1:num_samples,training)
testingSet <- subset(starData[testing, ])
n <- names(starData)
f <- as.formula(paste("koi_disposition ~", paste(n[!n %in% "koi_disposition"], collapse = " + ")))
f
## koi_disposition ~ koi_fpflag_nt + koi_fpflag_ss + koi_fpflag_co +
## koi_fpflag_ec + koi_tce_plnt_num + koi_tce_delivname + koi_period +
## koi_period_err1 + koi_period_err2 + koi_time0bk + koi_time0bk_err1 +
## koi_time0bk_err2 + koi_impact + koi_impact_err1 + koi_impact_err2 +
## koi_duration + koi_duration_err1 + koi_duration_err2 + koi_depth +
## koi_depth_err1 + koi_depth_err2 + koi_prad + koi_prad_err1 +
## koi_prad_err2 + koi_teq + koi_insol + koi_insol_err1 + koi_insol_err2 +
## koi_model_snr + koi_steff + koi_steff_err1 + koi_steff_err2 +
## koi_slogg + koi_slogg_err1 + koi_slogg_err2 + koi_srad +
## koi_srad_err1 + koi_srad_err2 + ra + dec + koi_kepmag
nnModel <- neuralnet(f, data=trainingSet, hidden=c(7,5,3), linear.output=FALSE)
plot(nnModel)
predictedLabels <-compute(nnModel, testingSet[,2:42])
predictedLabels<-round(predictedLabels$net.result)
sizeTestSet = dim(testingSet)[1]
error = sum(predictedLabels != testingSet$koi_disposition)
misclassification_rate = error/sizeTestSet
print(misclassification_rate)
## [1] 0.7705175118
After conducting all of the models, the most accurate model is the randomforest model since it had the lowest misclassification rate of approximately 11%.
One of the more major investigations we conducted was for an unsupervised learning problem: Given the planet data available, could we use unsupervised learning methods to come up with some planet categorization scheme? And, assuming we have a viable scheme, does our categorization scheme match any of those which astronomers have already devised?
The tool we wanted to use was k-means clustering. We first needed to decide which planets we would like to perform clustering on. and settled on planets with a koi_score of at least 0.8, because we wanted to be reasonably certain that the observations we included were planets. This still left us with a decent number of datapoints to work with.
# Verify there is an adequate volume of data after proposed koi_score filtering.
num_pts <- sum(kepler_df$koi_score >= 0.8 & !is.na(kepler_df$koi_score))
ggplot(kepler_df, aes(x = koi_score, fill = koi_pdisposition)) + geom_histogram() + geom_vline(xintercept=0.8, colour="orange", linetype = "longdash") + annotate("text", x = 0.7, y = 2000, label = "koi_score\ncutoff")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1510 rows containing non-finite values (stat_bin).
print(paste("The number of data points for koi_score >= 0.8:", num_pts))
## [1] "The number of data points for koi_score >= 0.8: 3682"
planets_df <- subset(kepler_df, koi_score >= 0.8)
head(planets_df)
## kepoi_name kepler_name koi_disposition koi_pdisposition koi_score
## 1 K00752.01 Kepler-227 b CONFIRMED CANDIDATE 1.000
## 2 K00752.02 Kepler-227 c CONFIRMED CANDIDATE 0.969
## 5 K00755.01 Kepler-664 b CONFIRMED CANDIDATE 1.000
## 6 K00756.01 Kepler-228 d CONFIRMED CANDIDATE 1.000
## 7 K00756.02 Kepler-228 c CONFIRMED CANDIDATE 1.000
## 8 K00756.03 Kepler-228 b CONFIRMED CANDIDATE 0.992
## koi_fpflag_nt koi_fpflag_ss koi_fpflag_co koi_fpflag_ec
## 1 0 No binary star detected 0 0
## 2 0 No binary star detected 0 0
## 5 0 No binary star detected 0 0
## 6 0 No binary star detected 0 0
## 7 0 No binary star detected 0 0
## 8 0 No binary star detected 0 0
## koi_period koi_period_err1 koi_period_err2 koi_time0bk
## 1 9.488035570 0.000027750 -0.000027750 170.53875
## 2 54.418382700 0.000247900 -0.000247900 162.51384
## 5 2.525591777 0.000003761 -0.000003761 171.59555
## 6 11.094320540 0.000020360 -0.000020360 171.20116
## 7 4.134435120 0.000010460 -0.000010460 172.97937
## 8 2.566588970 0.000017810 -0.000017810 179.55437
## koi_time0bk_err1 koi_time0bk_err2 koi_impact koi_impact_err1
## 1 0.00216 -0.00216 0.146 0.318
## 2 0.00352 -0.00352 0.586 0.059
## 5 0.00113 -0.00113 0.701 0.235
## 6 0.00141 -0.00141 0.538 0.030
## 7 0.00190 -0.00190 0.762 0.139
## 8 0.00461 -0.00461 0.755 0.212
## koi_impact_err2 koi_duration koi_duration_err1 koi_duration_err2
## 1 -0.146 2.9575 0.0819 -0.0819
## 2 -0.443 4.5070 0.1160 -0.1160
## 5 -0.478 1.6545 0.0420 -0.0420
## 6 -0.428 4.5945 0.0610 -0.0610
## 7 -0.532 3.1402 0.0673 -0.0673
## 8 -0.523 2.4290 0.1650 -0.1650
## koi_depth koi_depth_err1 koi_depth_err2 koi_prad koi_prad_err1
## 1 615.8 19.5 -19.5 2.26 0.26
## 2 874.8 35.5 -35.5 2.83 0.32
## 5 603.3 16.9 -16.9 2.75 0.88
## 6 1517.5 24.2 -24.2 3.90 1.27
## 7 686.0 18.7 -18.7 2.77 0.90
## 8 226.5 16.8 -16.8 1.59 0.52
## koi_prad_err2 koi_teq koi_insol koi_insol_err1 koi_insol_err2
## 1 -0.15 793 93.59 29.45 -16.65
## 2 -0.19 443 9.11 2.87 -1.62
## 5 -0.35 1406 926.16 874.33 -314.24
## 6 -0.42 835 114.81 112.85 -36.70
## 7 -0.30 1160 427.65 420.33 -136.70
## 8 -0.17 1360 807.74 793.91 -258.20
## koi_model_snr koi_tce_plnt_num koi_tce_delivname koi_steff
## 1 35.8 1 q1_q17_dr25_tce 5455
## 2 25.8 2 q1_q17_dr25_tce 5455
## 5 40.9 1 q1_q17_dr25_tce 6031
## 6 66.5 1 q1_q17_dr25_tce 6046
## 7 40.2 2 q1_q17_dr25_tce 6046
## 8 15.0 3 q1_q17_dr25_tce 6046
## koi_steff_err1 koi_steff_err2 koi_slogg koi_slogg_err1 koi_slogg_err2
## 1 81 -81 4.467 0.064 -0.096
## 2 81 -81 4.467 0.064 -0.096
## 5 169 -211 4.438 0.070 -0.210
## 6 189 -232 4.486 0.054 -0.229
## 7 189 -232 4.486 0.054 -0.229
## 8 189 -232 4.486 0.054 -0.229
## koi_srad koi_srad_err1 koi_srad_err2 ra dec koi_kepmag
## 1 0.927 0.105 -0.061 291.93423 48.141651 15.347
## 2 0.927 0.105 -0.061 291.93423 48.141651 15.347
## 5 1.046 0.334 -0.133 288.75488 48.226200 15.509
## 6 0.972 0.315 -0.105 296.28613 48.224670 15.714
## 7 0.972 0.315 -0.105 296.28613 48.224670 15.714
## 8 0.972 0.315 -0.105 296.28613 48.224670 15.714
## koi_els
## 1 Not Earth-like
## 2 Not Earth-like
## 5 Not Earth-like
## 6 Not Earth-like
## 7 Not Earth-like
## 8 Not Earth-like
str(planets_df)
## 'data.frame': 3682 obs. of 47 variables:
## $ kepoi_name : Factor w/ 9564 levels "K00001.01","K00002.01",..: 1081 1082 1085 1086 1087 1088 1089 1 2 11 ...
## $ kepler_name : Factor w/ 2294 levels "Kepler-1 b","Kepler-10 b",..: 1036 1037 1868 1040 1039 1038 1042 1 954 2031 ...
## $ koi_disposition : Factor w/ 3 levels "CANDIDATE","CONFIRMED",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ koi_pdisposition : Factor w/ 2 levels "CANDIDATE","FALSE POSITIVE": 1 1 1 1 1 1 1 1 1 1 ...
## $ koi_score : num 1 0.969 1 1 1 0.992 1 0.811 1 0.998 ...
## $ koi_fpflag_nt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ koi_fpflag_ss : chr "No binary star detected" "No binary star detected" "No binary star detected" "No binary star detected" ...
## $ koi_fpflag_co : int 0 0 0 0 0 0 0 0 0 0 ...
## $ koi_fpflag_ec : int 0 0 0 0 0 0 0 0 0 0 ...
## $ koi_period : num 9.49 54.42 2.53 11.09 4.13 ...
## $ koi_period_err1 : num 0.00002775 0.0002479 0.00000376 0.00002036 0.00001046 ...
## $ koi_period_err2 : num -0.00002775 -0.0002479 -0.00000376 -0.00002036 -0.00001046 ...
## $ koi_time0bk : num 171 163 172 171 173 ...
## $ koi_time0bk_err1 : num 0.00216 0.00352 0.00113 0.00141 0.0019 0.00461 0.000517 0.0000087 0.000016 0.0000471 ...
## $ koi_time0bk_err2 : num -0.00216 -0.00352 -0.00113 -0.00141 -0.0019 -0.00461 -0.000517 -0.0000087 -0.000016 -0.0000471 ...
## $ koi_impact : num 0.146 0.586 0.701 0.538 0.762 0.755 0.052 0.818 0.224 0.631 ...
## $ koi_impact_err1 : num 0.318 0.059 0.235 0.03 0.139 0.212 0.262 0.001 0.159 0.007 ...
## $ koi_impact_err2 : num -0.146 -0.443 -0.478 -0.428 -0.532 -0.523 -0.052 -0.001 -0.216 -0.007 ...
## $ koi_duration : num 2.96 4.51 1.65 4.59 3.14 ...
## $ koi_duration_err1: num 0.0819 0.116 0.042 0.061 0.0673 0.165 0.0241 0.00107 0.00203 0.00653 ...
## $ koi_duration_err2: num -0.0819 -0.116 -0.042 -0.061 -0.0673 -0.165 -0.0241 -0.00107 -0.00203 -0.00653 ...
## $ koi_depth : num 616 875 603 1518 686 ...
## $ koi_depth_err1 : num 19.5 35.5 16.9 24.2 18.7 16.8 33.3 4.2 1.7 6.6 ...
## $ koi_depth_err2 : num -19.5 -35.5 -16.9 -24.2 -18.7 -16.8 -33.3 -4.2 -1.7 -6.6 ...
## $ koi_prad : num 2.26 2.83 2.75 3.9 2.77 ...
## $ koi_prad_err1 : num 0.26 0.32 0.88 1.27 0.9 0.52 0.22 0.51 0.81 1.11 ...
## $ koi_prad_err2 : num -0.15 -0.19 -0.35 -0.42 -0.3 -0.17 -0.49 -0.51 -0.91 -1.11 ...
## $ koi_teq : num 793 443 1406 835 1160 ...
## $ koi_insol : num 93.59 9.11 926.16 114.81 427.65 ...
## $ koi_insol_err1 : num 29.45 2.87 874.33 112.85 420.33 ...
## $ koi_insol_err2 : num -16.65 -1.62 -314.24 -36.7 -136.7 ...
## $ koi_model_snr : num 35.8 25.8 40.9 66.5 40.2 ...
## $ koi_tce_plnt_num : int 1 2 1 1 2 3 1 1 1 1 ...
## $ koi_tce_delivname: Factor w/ 3 levels "q1_q16_tce","q1_q17_dr24_tce",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ koi_steff : num 5455 5455 6031 6046 6046 ...
## $ koi_steff_err1 : num 81 81 169 189 189 189 75 78 76 112 ...
## $ koi_steff_err2 : num -81 -81 -211 -232 -232 -232 -83 -78 -89 -137 ...
## $ koi_slogg : num 4.47 4.47 4.44 4.49 4.49 ...
## $ koi_slogg_err1 : num 0.064 0.064 0.07 0.054 0.054 0.054 0.083 0.024 0.033 0.055 ...
## $ koi_slogg_err2 : num -0.096 -0.096 -0.21 -0.229 -0.229 -0.229 -0.028 -0.024 -0.027 -0.045 ...
## $ koi_srad : num 0.927 0.927 1.046 0.972 0.972 ...
## $ koi_srad_err1 : num 0.105 0.105 0.334 0.315 0.315 0.315 0.033 0.038 0.099 0.11 ...
## $ koi_srad_err2 : num -0.061 -0.061 -0.133 -0.105 -0.105 -0.105 -0.072 -0.038 -0.11 -0.11 ...
## $ ra : num 292 292 289 296 296 ...
## $ dec : num 48.1 48.1 48.2 48.2 48.2 ...
## $ koi_kepmag : num 15.3 15.3 15.5 15.7 15.7 ...
## $ koi_els : chr "Not Earth-like" "Not Earth-like" "Not Earth-like" "Not Earth-like" ...
For clustering we needed to narrow down the appropriate feature set. Three criteria were used to narrow down the feature set: 1. Ignore non-numeric features (since we’ll use Euclidean distances for clustering). 2. Ignore data that have no relation to the physical features of the planet. 3. Ignore redundant data.
# Vector of features to include
keep <- c("koi_period","koi_impact","koi_duration","koi_depth","koi_prad","koi_teq")
# Create dataframe for k-means
plnts_clst_df <- planets_df[,keep]
plnts_clst_df$koi_teq <- as.numeric(plnts_clst_df$koi_teq)
# Remove "NA" rows
plnts_clst_df <- subset(plnts_clst_df, !is.na(koi_teq))
head(plnts_clst_df)
## koi_period koi_impact koi_duration koi_depth koi_prad koi_teq
## 1 9.488035570 0.146 2.9575 615.8 2.26 793
## 2 54.418382700 0.586 4.5070 874.8 2.83 443
## 5 2.525591777 0.701 1.6545 603.3 2.75 1406
## 6 11.094320540 0.538 4.5945 1517.5 3.90 835
## 7 4.134435120 0.762 3.1402 686.0 2.77 1160
## 8 2.566588970 0.755 2.4290 226.5 1.59 1360
str(plnts_clst_df)
## 'data.frame': 3679 obs. of 6 variables:
## $ koi_period : num 9.49 54.42 2.53 11.09 4.13 ...
## $ koi_impact : num 0.146 0.586 0.701 0.538 0.762 0.755 0.052 0.818 0.224 0.631 ...
## $ koi_duration: num 2.96 4.51 1.65 4.59 3.14 ...
## $ koi_depth : num 616 875 603 1518 686 ...
## $ koi_prad : num 2.26 2.83 2.75 3.9 2.77 ...
## $ koi_teq : num 793 443 1406 835 1160 ...
Feature normalization:
# Normalize using z scores:
norm_plnts_clst_df <- plnts_clst_df
for (i in 1:dim(norm_plnts_clst_df)[2]) {
norm_plnts_clst_df[,i] <- (norm_plnts_clst_df[,i]-mean(norm_plnts_clst_df[,i]))/sd(norm_plnts_clst_df[,i])
}
head(norm_plnts_clst_df)
## koi_period koi_impact koi_duration koi_depth koi_prad
## 1 -0.3609254105 -0.8825612563 -0.4096494530 -0.13863335277 -0.13065565704
## 2 0.5586232648 0.3880442541 0.1128989621 -0.06735913965 -0.07971832878
## 5 -0.5034194345 0.7201343307 -0.8490689980 -0.14207322792 -0.08686742748
## 6 -0.3280510324 0.2494327439 0.1424071817 0.10950548110 0.01590086638
## 7 -0.4704926966 0.8962864583 -0.3480362904 -0.11931501392 -0.08508015280
## 8 -0.5025803822 0.8760722798 -0.5878790996 -0.24576482446 -0.19052935868
## koi_teq
## 1 -0.1861045735
## 2 -0.8533287157
## 5 0.9824908527
## 6 -0.1060376764
## 7 0.5135275985
## 8 0.8947985369
totalWithnss = c()
betweenss = c()
withinss <- c()
# Use k folds to acheive stability because centroids are selected randomly
for(clusters in 2:80)
{
fit <- kmeans(norm_plnts_clst_df, clusters)
totalWithnss[clusters] <- fit$tot.withinss
betweenss[clusters] <- fit$betweenss
}
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
plot(totalWithnss)
plot(betweenss)
plot(totalWithnss/betweenss)
The “totalWithnss”, “betweenss”, and “totalWithnss/betweenss” plots seemed to suggest that k = 20 was a good choice for the number of starting centroids and, therefore, the number of categories present.
# Apply the appropriate number of categories.
fit <- kmeans(norm_plnts_clst_df, 20)
plnts_clst_df$Category <- fit$cluster
norm_plnts_clst_df$Category <- fit$cluster
plnts_clst_df$Category <- as.factor(plnts_clst_df$Category)
norm_plnts_clst_df$Category <- as.factor(norm_plnts_clst_df$Category)
Now that all observations were associated with one of twenty categories, according to k-means, we tried to understand which features are responsible for the most differentiation between the categories present.
# create a daraframe that excludes the categories
pca_plnts_clst_df <- plnts_clst_df[1:50,1:6]
pca_model <- prcomp(pca_plnts_clst_df, center = TRUE, scale. = TRUE)
biplot(pca_model)
The principal component graph suggested that koi_teq and koi_period were very useful for desdcribing differentiation between categories. We also see that koi_prad, koi_imapct, and koi_depth were vectors that essentially caused separation of clusters in the same direction. So perhaps only one of them was needed. The same was said about koi_duraiton and koi_period.
To bolster this analysis (i.e. finding the features most repsonsible for cluster separation), we used a decision tree to reveal the most important features that determined accurate categorizaiton.
We’ll use the “plnts_clst_df” data frame to build the decision tree.
dt_model <- rpart(data = plnts_clst_df, Category ~.)
plotcp(dt_model)
prp(dt_model)
pruned_dt_model <- prune(dt_model, cp = 0.023)
prp(pruned_dt_model)
The tree suggested that koi_impact, koi_teq, and koi_duration were the most important features. These results seemed to agree with the principal component analysis above.
We then created histograms that showed the distributions of categories accross all the features taken for the planets.
ggplot(plnts_clst_df, aes(x = koi_teq, fill = Category)) + geom_histogram(binwidth = 50) + xlim(c(0,4000))
## Warning: Removed 8 rows containing non-finite values (stat_bin).
ggplot(plnts_clst_df, aes(x = koi_impact, fill = Category)) + geom_histogram(binwidth = 0.01)
ggplot(plnts_clst_df, aes(x = koi_prad, fill = Category)) + geom_histogram(binwidth = 0.2) + xlim(c(0,10))
## Warning: Removed 209 rows containing non-finite values (stat_bin).
ggplot(plnts_clst_df, aes(x = koi_depth, fill = Category)) + geom_histogram(binwidth = 20) + xlim(c(0,5000))
## Warning: Removed 151 rows containing non-finite values (stat_bin).
ggplot(plnts_clst_df, aes(x = koi_duration, fill = Category)) + geom_histogram(binwidth = 0.2) + xlim(c(0,20))
## Warning: Removed 14 rows containing non-finite values (stat_bin).
ggplot(plnts_clst_df, aes(x = koi_period, fill = Category)) + geom_histogram(binwidth = 1) + xlim(c(0,100))
## Warning: Removed 231 rows containing non-finite values (stat_bin).
Scanning the histograms of koi_teq, koi_impact, and koi_duration showed certain ranges in which some categories were prevalent and others are not. That is to say, there were bounds of separation between categories within these features which seemed to be a pretty good foundation for a feature space.
We created some scatter plots to enhance the visual analysis.
ggplot(plnts_clst_df, aes(x = koi_impact, y = koi_teq, color = Category)) + geom_point() + ylim(c(0,2000)) + xlim(c(0,1.5))
## Warning: Removed 91 rows containing missing values (geom_point).
ggplot(plnts_clst_df, aes(x = koi_impact, y = koi_duration, color = Category)) + geom_point() + ylim(c(0,10)) + xlim(c(0,1.5))
## Warning: Removed 171 rows containing missing values (geom_point).
ggplot(plnts_clst_df, aes(x = koi_teq, y = koi_duration, color = Category)) + geom_point() + ylim(c(0,10)) + xlim(c(0,4000))
## Warning: Removed 178 rows containing missing values (geom_point).
Since we had three principal features constituting the feature space, we tried a 3D plot.
plot_3d_plnts_clst_df <- subset(plnts_clst_df, koi_teq <= 2000 & koi_impact >= 0 & koi_impact <= 1.5 & koi_duration <= 20)
plot_ly(plot_3d_plnts_clst_df, x = ~koi_impact, y = ~koi_teq, z = ~koi_duration, color = ~Category) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Sky-Projected Distance'),
yaxis = list(title = 'Effective Temperature (K)'),
zaxis = list(title = 'Duration of Transit (Hours)')))
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
The three-dimension plot showed the clusters much more clearly than any of the 2D feature spaces built above. Although the clustering was not perfect due to the fact we chose to ignore several features, the 3D plot did confirm the usefullness of using koi_impact, koi_teq, and koi_duration to explain most clustering between points.
The final step in this clustering analysis was to segment the categorized planet dataset according to categories, then look at the feature distributions for koi_impact, koi_teq, and koi_duration for each category.
# Subset dataframe on a categorical basis
#for (i in 1:20) {
# temp_df2 <- subset(plnts_clst_df, Category == i)
# for (j in 1:(dim(plnts_clst_df)[2]-1)) {
# result.mean <- mean(plnts_clst_df[,names(plnts_clst_df)[j]], na.rm = TRUE)
# result.median <- median(plnts_clst_df[,names(plnts_clst_df)[j]], na.rm = TRUE)
# result.sd <- sd(plnts_clst_df[,names(plnts_clst_df)[j]], na.rm = TRUE)
# result.max <- max(plnts_clst_df[,names(plnts_clst_df)[j]], na.rm = TRUE)
# result.min <- min(plnts_clst_df[,names(plnts_clst_df)[j]], na.rm = TRUE)
# Print the results
# print(paste("The results for",names(plnts_clst_df)[j],"of category",i,"are..."))
# print(paste("The mean:",result.mean))
# print(paste("The median:",result.median))
# print(paste("The std. dev.:",result.sd))
# print(paste("The max:",result.max))
# print(paste("The min:",result.min))
# print("* * *")
# }
#}
We found it helpful to create the following plots.
for (i in 1:(dim(plnts_clst_df)[2]-1)){
plot <- ggplot(plnts_clst_df, aes(x = plnts_clst_df[i], y = Category, color = Category)) + geom_point() + xlab(names(plnts_clst_df)[i]) + theme(legend.position="none")
print(plot)
}
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
It was difficult to make comments on the categorization scheme developed via k-means without thorough research being conducted on the features themselves and how they may be related to planet characteristics not explicitly found in the data. We should also note that one of the principal features – koi_impact – was not understood very well. So what this measure reveals about more tangible planet characteristics is still a mystery.
We’ll conclude the unsupervised learning and analysis here and leave the in-depth research on the categories above for a later time.
After conducting EDA, supervised and unsupervised learning on the Kepler Dataset, we were able to learn a lot about this unfamiliar topic and formulate many conclusions. To reiterate, the questions we asked and key takeaways are shown below:
When conductory an exploratory data anaylsis, we learned from Q1 that binary stars have a much smaller proportion of likely planets encircling them than do single stars, and this is seen in both the Kepler analysis labels and literature labels. From Q2, it appears that all the Earth-like planets in the dataset are considerably warmer than the Earth. From Q3, we noticed that sky-projected distance is very weakly, and essentially uncorrelated, with any other features of interest describing either the planet or the host star. From Q4, we realized that most of the stars that host Earth-like planets seem to be smaller, cooler, and have larger surface accelerations when comapred to our Sun. Last in EDA, we noticed that the observations corresponding to Earth-like planets are spread out accross the patches of celestial coordinates observed by Kepler.
For this question we conducted the following data science models and the misclassification rate was calculated for each model as follows:
- Decision Tree: 12% - Randomforest: 10% - KNN: 23% - SVM: 22% - Neural Network: 78%
The RandomForest model was clearly the most accurate model with the lowest misclassification rate of 10%. From the decision tree, we noticed that the most important factors for determining the classification of a planet were the false positive flags. Hopefully, with this 10% classification rate, this data science model could save time and resources for NASA when they are confirming planets.
Through various means of clustering and analysis, we noticed a few strong characteristics when categorizing planets: koi_impact, koi_teq, and koi_duration. These variables were very useful for describing differentiation between categories. One major challenge we faced was being unfamiliar with this topic and our lack of knowledge limited our analysis especially in unsupervised learning. Without further research on planetary features, it was hard to draw conclusions. This is a key focus moving forward in order to enhance our unsupervised learning analysis.
It was a great challenge to work with a dataset without any prior knowledge of the characteristis and features of a planet. However, through the power of data science, we were able to discover hidden relationships and the most important factors when answering our questions for supervised and unsupervised learning. Moving forward, in order to improve our data model, we will focus on learning more about each feature and talk to an expert in this particular field to enhance data quality, recognize errors in the data, better data analysis and discover more relevant relationships & planet characteristics in unsuperised learning.
As well, when we were predicting the classification of KOI’s disposition, it was complex to perform a logistic regression since there were three categories. This is a model that will need to be explored and completed in the future.
We also faced problems with neural networks. We inserted too many inputs into the network, which led to many unneccesary layers that confused the system leading to an incredibly high misclassification rate. This is something that will need to be simplified and explored further in order to grasp the great capabilities of neural networks.